a novel approach for breast cancer detection using data mining techniques

104
A Novel Approach for Breast Cancer Detection using Data Mining Techniques Presented by: •Ahmed Abd Elhafeez 1 06/06/2022 AAST-Comp eng

Upload: ahmad-abdelhafeez

Post on 19-Aug-2014

224 views

Category:

Engineering


5 download

DESCRIPTION

 

TRANSCRIPT

Page 1: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023 AAST-Comp eng 1

A Novel Approach for Breast Cancer Detection usingData Mining Techniques

Presented bybull Ahmed Abd Elhafeez

>
>

04072023AAST-Comp eng2

AGENDA Scientific and Medical Background1 What is cancer2 Breast cancer3 History and Background4 Pattern recognition system

decomposition5 About data mining6 Data mining tools 7 Classification Techniques

04072023AAST-Comp eng3

AGENDA (Cont) Paper contents1 Introduction2 Related Work3 Classification Techniques4 Experiments and Results5 Conclusion6 References

04072023

What Is Cancer Cancer is a term used for diseases in which

abnormal cells divide without control and are able to invade other tissues Cancer cells can spread to other parts of the body through the blood and lymph systems

Cancer is not just one disease but many diseases There are more than 100 different types of cancer

Most cancers are named for the organ or type of cell in which they start

There are two general types of cancer tumours namelybull benignbull malignant

4 AAST-Comp eng

Skin cancer

Breast cancerColon cancer

Lung cancer

Pancreatic cancer

Liver cancer

Bladder cancer

Prostate Cancer

Kidney cancerThyroid Cancer

Leukemia Cancer

Edometrial Cancer

Rectal Cancer

Non-Hodgkin LymphomaCervical cancer

Thyroid Cancer

Oral cancer

AAST-Comp eng 504072023

Breast Cancer

6

bull The second leading cause of death among women is breast cancer as it comes directly after lung cancer

bull Breast cancer considered the most common invasive cancer in women with more than one million cases and nearly 600000 deaths occurring worldwide annually

bull Breast cancer comes in the top of cancer list in Egypt by 42 cases per 100 thousand of the population However 80 of the cases of breast cancer in Egypt are of the benign kind

AAST-Comp eng04072023

History and Background

Medical Prognosis is the estimation of bull Curebull Complicationbull disease recurrencebull Survival for a patient or group of patients after treatment

7AAST-Comp eng04072023

Breast Cancer Classification

8AAST-Comp eng

Round well-defined larger groups are more likely benign

Tight cluster of tiny irregularly shaped groups may indicate cancer Malignant

Suspicious pixels groups show up as white spots on a mammogram

04072023

Breast cancerrsquos Featuresbull MRI - Cancer can have a unique appearance ndash

features that turned out to be cancer used for diagnosis prognosis of each cell nucleus

9AAST-Comp eng

F2Magnetic Resonance Image

F1

F3

Fn

Feature

Extraction

04072023

Diagnosis or prognosis

Brest CancerBenign

Malignant

AAST-Comp eng 1004072023

04072023 AAST-Comp eng 11

Computer-Aided Diagnosis

bull Mammography allows for efficient diagnosis of breast cancers at an earlier stage

bull Radiologists misdiagnose 10-30 of the malignant cases

bull Of the cases sent for surgical biopsy only 10-20 are actually malignant

Computational Intelligence

Computational IntelligenceData + Knowledge

Artificial Intelligence

Expert systems

Fuzzylogic

PatternRecognition

Machinelearning

Probabilistic methods

Multivariatestatistics

Visuali-zation

Evolutionaryalgorithms

Neuralnetworks

04072023 AAST-Comp eng 12

What do these methods do

bull Provide non-parametric models of databull Allow to classify new data to pre-defined

categories supporting diagnosis amp prognosis

bull Allow to discover new categoriesbull Allow to understand the data creating fuzzy

or crisp logical rulesbull Help to visualize multi-dimensional

relationships among data samples 04072023 AAST-Comp eng 13

14

Feature selection

Data Preprocessing

Selecting Data mining tool dataset

Classification algorithm

SMO IBK BF TREE

Results and evaluationsAAST-Comp eng

Pattern recognition system decomposition

04072023

Results

Data preprocessing

Feature selectionClassification

Selection tool data mining

Performance evaluation Cycle

Dataset

data sets

AAST-Comp eng 1604072023

results

Data preprocessing

Feature selectionclassification

Selection tool datamining

Performance evaluation Cycle

Dataset

AAST-Comp eng 18

Data Mining

bull Data Mining is set of techniques used in various domains to give meaning to the available data

bull Objective Fit data to a modelndashDescriptivendashPredictive

04072023

Predictive amp descriptive data mining

bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples

bull Descriptive Is to describe the general or special features of a set of data in a concise manner

AAST-Comp eng 1904072023

AAST-Comp eng 20

Data Mining Models and Tasks

04072023

Data mining Tools

Many advanced tools for data mining are available either as open-source or commercial software

21AAST-Comp eng04072023

wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for

data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code

bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes

bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature

04072023 AAST-Comp eng 22

Results

Data preprocessing

Feature selection Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Data Preprocessing

bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes

of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names

bull Quality decisions must be based on quality data measures

Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility

AAST-Comp eng 2404072023

Preprocessing techniques

bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and

resolve inconsistencies

bull Data integrationndash Integration of multiple databases data cubes or files

bull Data transformationndash Normalization and aggregation

bull Data reductionndash Obtains reduced representation in volume but produces the same or

similar analytical results

bull Data discretizationndash Part of data reduction but with particular importance especially for

numerical data

AAST-Comp eng 2504072023

Results

Data preprocessing

Feature selection

Classification

Selection tool datamining

Performance evaluation Cycle

Dataset

Finding a feature subset that has the most discriminative information from the original feature space

The objective of feature selection is bull Improving the prediction performance of the

predictorsbull Providing a faster and more cost-effective

predictorsbull Providing a better understanding of the underlying

process that generated the data

Feature selection

AAST-Comp eng 2704072023

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 2: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng2

AGENDA Scientific and Medical Background1 What is cancer2 Breast cancer3 History and Background4 Pattern recognition system

decomposition5 About data mining6 Data mining tools 7 Classification Techniques

04072023AAST-Comp eng3

AGENDA (Cont) Paper contents1 Introduction2 Related Work3 Classification Techniques4 Experiments and Results5 Conclusion6 References

04072023

What Is Cancer Cancer is a term used for diseases in which

abnormal cells divide without control and are able to invade other tissues Cancer cells can spread to other parts of the body through the blood and lymph systems

Cancer is not just one disease but many diseases There are more than 100 different types of cancer

Most cancers are named for the organ or type of cell in which they start

There are two general types of cancer tumours namelybull benignbull malignant

4 AAST-Comp eng

Skin cancer

Breast cancerColon cancer

Lung cancer

Pancreatic cancer

Liver cancer

Bladder cancer

Prostate Cancer

Kidney cancerThyroid Cancer

Leukemia Cancer

Edometrial Cancer

Rectal Cancer

Non-Hodgkin LymphomaCervical cancer

Thyroid Cancer

Oral cancer

AAST-Comp eng 504072023

Breast Cancer

6

bull The second leading cause of death among women is breast cancer as it comes directly after lung cancer

bull Breast cancer considered the most common invasive cancer in women with more than one million cases and nearly 600000 deaths occurring worldwide annually

bull Breast cancer comes in the top of cancer list in Egypt by 42 cases per 100 thousand of the population However 80 of the cases of breast cancer in Egypt are of the benign kind

AAST-Comp eng04072023

History and Background

Medical Prognosis is the estimation of bull Curebull Complicationbull disease recurrencebull Survival for a patient or group of patients after treatment

7AAST-Comp eng04072023

Breast Cancer Classification

8AAST-Comp eng

Round well-defined larger groups are more likely benign

Tight cluster of tiny irregularly shaped groups may indicate cancer Malignant

Suspicious pixels groups show up as white spots on a mammogram

04072023

Breast cancerrsquos Featuresbull MRI - Cancer can have a unique appearance ndash

features that turned out to be cancer used for diagnosis prognosis of each cell nucleus

9AAST-Comp eng

F2Magnetic Resonance Image

F1

F3

Fn

Feature

Extraction

04072023

Diagnosis or prognosis

Brest CancerBenign

Malignant

AAST-Comp eng 1004072023

04072023 AAST-Comp eng 11

Computer-Aided Diagnosis

bull Mammography allows for efficient diagnosis of breast cancers at an earlier stage

bull Radiologists misdiagnose 10-30 of the malignant cases

bull Of the cases sent for surgical biopsy only 10-20 are actually malignant

Computational Intelligence

Computational IntelligenceData + Knowledge

Artificial Intelligence

Expert systems

Fuzzylogic

PatternRecognition

Machinelearning

Probabilistic methods

Multivariatestatistics

Visuali-zation

Evolutionaryalgorithms

Neuralnetworks

04072023 AAST-Comp eng 12

What do these methods do

bull Provide non-parametric models of databull Allow to classify new data to pre-defined

categories supporting diagnosis amp prognosis

bull Allow to discover new categoriesbull Allow to understand the data creating fuzzy

or crisp logical rulesbull Help to visualize multi-dimensional

relationships among data samples 04072023 AAST-Comp eng 13

14

Feature selection

Data Preprocessing

Selecting Data mining tool dataset

Classification algorithm

SMO IBK BF TREE

Results and evaluationsAAST-Comp eng

Pattern recognition system decomposition

04072023

Results

Data preprocessing

Feature selectionClassification

Selection tool data mining

Performance evaluation Cycle

Dataset

data sets

AAST-Comp eng 1604072023

results

Data preprocessing

Feature selectionclassification

Selection tool datamining

Performance evaluation Cycle

Dataset

AAST-Comp eng 18

Data Mining

bull Data Mining is set of techniques used in various domains to give meaning to the available data

bull Objective Fit data to a modelndashDescriptivendashPredictive

04072023

Predictive amp descriptive data mining

bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples

bull Descriptive Is to describe the general or special features of a set of data in a concise manner

AAST-Comp eng 1904072023

AAST-Comp eng 20

Data Mining Models and Tasks

04072023

Data mining Tools

Many advanced tools for data mining are available either as open-source or commercial software

21AAST-Comp eng04072023

wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for

data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code

bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes

bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature

04072023 AAST-Comp eng 22

Results

Data preprocessing

Feature selection Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Data Preprocessing

bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes

of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names

bull Quality decisions must be based on quality data measures

Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility

AAST-Comp eng 2404072023

Preprocessing techniques

bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and

resolve inconsistencies

bull Data integrationndash Integration of multiple databases data cubes or files

bull Data transformationndash Normalization and aggregation

bull Data reductionndash Obtains reduced representation in volume but produces the same or

similar analytical results

bull Data discretizationndash Part of data reduction but with particular importance especially for

numerical data

AAST-Comp eng 2504072023

Results

Data preprocessing

Feature selection

Classification

Selection tool datamining

Performance evaluation Cycle

Dataset

Finding a feature subset that has the most discriminative information from the original feature space

The objective of feature selection is bull Improving the prediction performance of the

predictorsbull Providing a faster and more cost-effective

predictorsbull Providing a better understanding of the underlying

process that generated the data

Feature selection

AAST-Comp eng 2704072023

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 3: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng3

AGENDA (Cont) Paper contents1 Introduction2 Related Work3 Classification Techniques4 Experiments and Results5 Conclusion6 References

04072023

What Is Cancer Cancer is a term used for diseases in which

abnormal cells divide without control and are able to invade other tissues Cancer cells can spread to other parts of the body through the blood and lymph systems

Cancer is not just one disease but many diseases There are more than 100 different types of cancer

Most cancers are named for the organ or type of cell in which they start

There are two general types of cancer tumours namelybull benignbull malignant

4 AAST-Comp eng

Skin cancer

Breast cancerColon cancer

Lung cancer

Pancreatic cancer

Liver cancer

Bladder cancer

Prostate Cancer

Kidney cancerThyroid Cancer

Leukemia Cancer

Edometrial Cancer

Rectal Cancer

Non-Hodgkin LymphomaCervical cancer

Thyroid Cancer

Oral cancer

AAST-Comp eng 504072023

Breast Cancer

6

bull The second leading cause of death among women is breast cancer as it comes directly after lung cancer

bull Breast cancer considered the most common invasive cancer in women with more than one million cases and nearly 600000 deaths occurring worldwide annually

bull Breast cancer comes in the top of cancer list in Egypt by 42 cases per 100 thousand of the population However 80 of the cases of breast cancer in Egypt are of the benign kind

AAST-Comp eng04072023

History and Background

Medical Prognosis is the estimation of bull Curebull Complicationbull disease recurrencebull Survival for a patient or group of patients after treatment

7AAST-Comp eng04072023

Breast Cancer Classification

8AAST-Comp eng

Round well-defined larger groups are more likely benign

Tight cluster of tiny irregularly shaped groups may indicate cancer Malignant

Suspicious pixels groups show up as white spots on a mammogram

04072023

Breast cancerrsquos Featuresbull MRI - Cancer can have a unique appearance ndash

features that turned out to be cancer used for diagnosis prognosis of each cell nucleus

9AAST-Comp eng

F2Magnetic Resonance Image

F1

F3

Fn

Feature

Extraction

04072023

Diagnosis or prognosis

Brest CancerBenign

Malignant

AAST-Comp eng 1004072023

04072023 AAST-Comp eng 11

Computer-Aided Diagnosis

bull Mammography allows for efficient diagnosis of breast cancers at an earlier stage

bull Radiologists misdiagnose 10-30 of the malignant cases

bull Of the cases sent for surgical biopsy only 10-20 are actually malignant

Computational Intelligence

Computational IntelligenceData + Knowledge

Artificial Intelligence

Expert systems

Fuzzylogic

PatternRecognition

Machinelearning

Probabilistic methods

Multivariatestatistics

Visuali-zation

Evolutionaryalgorithms

Neuralnetworks

04072023 AAST-Comp eng 12

What do these methods do

bull Provide non-parametric models of databull Allow to classify new data to pre-defined

categories supporting diagnosis amp prognosis

bull Allow to discover new categoriesbull Allow to understand the data creating fuzzy

or crisp logical rulesbull Help to visualize multi-dimensional

relationships among data samples 04072023 AAST-Comp eng 13

14

Feature selection

Data Preprocessing

Selecting Data mining tool dataset

Classification algorithm

SMO IBK BF TREE

Results and evaluationsAAST-Comp eng

Pattern recognition system decomposition

04072023

Results

Data preprocessing

Feature selectionClassification

Selection tool data mining

Performance evaluation Cycle

Dataset

data sets

AAST-Comp eng 1604072023

results

Data preprocessing

Feature selectionclassification

Selection tool datamining

Performance evaluation Cycle

Dataset

AAST-Comp eng 18

Data Mining

bull Data Mining is set of techniques used in various domains to give meaning to the available data

bull Objective Fit data to a modelndashDescriptivendashPredictive

04072023

Predictive amp descriptive data mining

bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples

bull Descriptive Is to describe the general or special features of a set of data in a concise manner

AAST-Comp eng 1904072023

AAST-Comp eng 20

Data Mining Models and Tasks

04072023

Data mining Tools

Many advanced tools for data mining are available either as open-source or commercial software

21AAST-Comp eng04072023

wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for

data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code

bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes

bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature

04072023 AAST-Comp eng 22

Results

Data preprocessing

Feature selection Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Data Preprocessing

bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes

of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names

bull Quality decisions must be based on quality data measures

Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility

AAST-Comp eng 2404072023

Preprocessing techniques

bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and

resolve inconsistencies

bull Data integrationndash Integration of multiple databases data cubes or files

bull Data transformationndash Normalization and aggregation

bull Data reductionndash Obtains reduced representation in volume but produces the same or

similar analytical results

bull Data discretizationndash Part of data reduction but with particular importance especially for

numerical data

AAST-Comp eng 2504072023

Results

Data preprocessing

Feature selection

Classification

Selection tool datamining

Performance evaluation Cycle

Dataset

Finding a feature subset that has the most discriminative information from the original feature space

The objective of feature selection is bull Improving the prediction performance of the

predictorsbull Providing a faster and more cost-effective

predictorsbull Providing a better understanding of the underlying

process that generated the data

Feature selection

AAST-Comp eng 2704072023

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 4: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023

What Is Cancer Cancer is a term used for diseases in which

abnormal cells divide without control and are able to invade other tissues Cancer cells can spread to other parts of the body through the blood and lymph systems

Cancer is not just one disease but many diseases There are more than 100 different types of cancer

Most cancers are named for the organ or type of cell in which they start

There are two general types of cancer tumours namelybull benignbull malignant

4 AAST-Comp eng

Skin cancer

Breast cancerColon cancer

Lung cancer

Pancreatic cancer

Liver cancer

Bladder cancer

Prostate Cancer

Kidney cancerThyroid Cancer

Leukemia Cancer

Edometrial Cancer

Rectal Cancer

Non-Hodgkin LymphomaCervical cancer

Thyroid Cancer

Oral cancer

AAST-Comp eng 504072023

Breast Cancer

6

bull The second leading cause of death among women is breast cancer as it comes directly after lung cancer

bull Breast cancer considered the most common invasive cancer in women with more than one million cases and nearly 600000 deaths occurring worldwide annually

bull Breast cancer comes in the top of cancer list in Egypt by 42 cases per 100 thousand of the population However 80 of the cases of breast cancer in Egypt are of the benign kind

AAST-Comp eng04072023

History and Background

Medical Prognosis is the estimation of bull Curebull Complicationbull disease recurrencebull Survival for a patient or group of patients after treatment

7AAST-Comp eng04072023

Breast Cancer Classification

8AAST-Comp eng

Round well-defined larger groups are more likely benign

Tight cluster of tiny irregularly shaped groups may indicate cancer Malignant

Suspicious pixels groups show up as white spots on a mammogram

04072023

Breast cancerrsquos Featuresbull MRI - Cancer can have a unique appearance ndash

features that turned out to be cancer used for diagnosis prognosis of each cell nucleus

9AAST-Comp eng

F2Magnetic Resonance Image

F1

F3

Fn

Feature

Extraction

04072023

Diagnosis or prognosis

Brest CancerBenign

Malignant

AAST-Comp eng 1004072023

04072023 AAST-Comp eng 11

Computer-Aided Diagnosis

bull Mammography allows for efficient diagnosis of breast cancers at an earlier stage

bull Radiologists misdiagnose 10-30 of the malignant cases

bull Of the cases sent for surgical biopsy only 10-20 are actually malignant

Computational Intelligence

Computational IntelligenceData + Knowledge

Artificial Intelligence

Expert systems

Fuzzylogic

PatternRecognition

Machinelearning

Probabilistic methods

Multivariatestatistics

Visuali-zation

Evolutionaryalgorithms

Neuralnetworks

04072023 AAST-Comp eng 12

What do these methods do

bull Provide non-parametric models of databull Allow to classify new data to pre-defined

categories supporting diagnosis amp prognosis

bull Allow to discover new categoriesbull Allow to understand the data creating fuzzy

or crisp logical rulesbull Help to visualize multi-dimensional

relationships among data samples 04072023 AAST-Comp eng 13

14

Feature selection

Data Preprocessing

Selecting Data mining tool dataset

Classification algorithm

SMO IBK BF TREE

Results and evaluationsAAST-Comp eng

Pattern recognition system decomposition

04072023

Results

Data preprocessing

Feature selectionClassification

Selection tool data mining

Performance evaluation Cycle

Dataset

data sets

AAST-Comp eng 1604072023

results

Data preprocessing

Feature selectionclassification

Selection tool datamining

Performance evaluation Cycle

Dataset

AAST-Comp eng 18

Data Mining

bull Data Mining is set of techniques used in various domains to give meaning to the available data

bull Objective Fit data to a modelndashDescriptivendashPredictive

04072023

Predictive amp descriptive data mining

bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples

bull Descriptive Is to describe the general or special features of a set of data in a concise manner

AAST-Comp eng 1904072023

AAST-Comp eng 20

Data Mining Models and Tasks

04072023

Data mining Tools

Many advanced tools for data mining are available either as open-source or commercial software

21AAST-Comp eng04072023

wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for

data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code

bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes

bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature

04072023 AAST-Comp eng 22

Results

Data preprocessing

Feature selection Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Data Preprocessing

bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes

of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names

bull Quality decisions must be based on quality data measures

Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility

AAST-Comp eng 2404072023

Preprocessing techniques

bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and

resolve inconsistencies

bull Data integrationndash Integration of multiple databases data cubes or files

bull Data transformationndash Normalization and aggregation

bull Data reductionndash Obtains reduced representation in volume but produces the same or

similar analytical results

bull Data discretizationndash Part of data reduction but with particular importance especially for

numerical data

AAST-Comp eng 2504072023

Results

Data preprocessing

Feature selection

Classification

Selection tool datamining

Performance evaluation Cycle

Dataset

Finding a feature subset that has the most discriminative information from the original feature space

The objective of feature selection is bull Improving the prediction performance of the

predictorsbull Providing a faster and more cost-effective

predictorsbull Providing a better understanding of the underlying

process that generated the data

Feature selection

AAST-Comp eng 2704072023

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 5: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Skin cancer

Breast cancerColon cancer

Lung cancer

Pancreatic cancer

Liver cancer

Bladder cancer

Prostate Cancer

Kidney cancerThyroid Cancer

Leukemia Cancer

Edometrial Cancer

Rectal Cancer

Non-Hodgkin LymphomaCervical cancer

Thyroid Cancer

Oral cancer

AAST-Comp eng 504072023

Breast Cancer

6

bull The second leading cause of death among women is breast cancer as it comes directly after lung cancer

bull Breast cancer considered the most common invasive cancer in women with more than one million cases and nearly 600000 deaths occurring worldwide annually

bull Breast cancer comes in the top of cancer list in Egypt by 42 cases per 100 thousand of the population However 80 of the cases of breast cancer in Egypt are of the benign kind

AAST-Comp eng04072023

History and Background

Medical Prognosis is the estimation of bull Curebull Complicationbull disease recurrencebull Survival for a patient or group of patients after treatment

7AAST-Comp eng04072023

Breast Cancer Classification

8AAST-Comp eng

Round well-defined larger groups are more likely benign

Tight cluster of tiny irregularly shaped groups may indicate cancer Malignant

Suspicious pixels groups show up as white spots on a mammogram

04072023

Breast cancerrsquos Featuresbull MRI - Cancer can have a unique appearance ndash

features that turned out to be cancer used for diagnosis prognosis of each cell nucleus

9AAST-Comp eng

F2Magnetic Resonance Image

F1

F3

Fn

Feature

Extraction

04072023

Diagnosis or prognosis

Brest CancerBenign

Malignant

AAST-Comp eng 1004072023

04072023 AAST-Comp eng 11

Computer-Aided Diagnosis

bull Mammography allows for efficient diagnosis of breast cancers at an earlier stage

bull Radiologists misdiagnose 10-30 of the malignant cases

bull Of the cases sent for surgical biopsy only 10-20 are actually malignant

Computational Intelligence

Computational IntelligenceData + Knowledge

Artificial Intelligence

Expert systems

Fuzzylogic

PatternRecognition

Machinelearning

Probabilistic methods

Multivariatestatistics

Visuali-zation

Evolutionaryalgorithms

Neuralnetworks

04072023 AAST-Comp eng 12

What do these methods do

bull Provide non-parametric models of databull Allow to classify new data to pre-defined

categories supporting diagnosis amp prognosis

bull Allow to discover new categoriesbull Allow to understand the data creating fuzzy

or crisp logical rulesbull Help to visualize multi-dimensional

relationships among data samples 04072023 AAST-Comp eng 13

14

Feature selection

Data Preprocessing

Selecting Data mining tool dataset

Classification algorithm

SMO IBK BF TREE

Results and evaluationsAAST-Comp eng

Pattern recognition system decomposition

04072023

Results

Data preprocessing

Feature selectionClassification

Selection tool data mining

Performance evaluation Cycle

Dataset

data sets

AAST-Comp eng 1604072023

results

Data preprocessing

Feature selectionclassification

Selection tool datamining

Performance evaluation Cycle

Dataset

AAST-Comp eng 18

Data Mining

bull Data Mining is set of techniques used in various domains to give meaning to the available data

bull Objective Fit data to a modelndashDescriptivendashPredictive

04072023

Predictive amp descriptive data mining

bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples

bull Descriptive Is to describe the general or special features of a set of data in a concise manner

AAST-Comp eng 1904072023

AAST-Comp eng 20

Data Mining Models and Tasks

04072023

Data mining Tools

Many advanced tools for data mining are available either as open-source or commercial software

21AAST-Comp eng04072023

wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for

data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code

bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes

bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature

04072023 AAST-Comp eng 22

Results

Data preprocessing

Feature selection Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Data Preprocessing

bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes

of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names

bull Quality decisions must be based on quality data measures

Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility

AAST-Comp eng 2404072023

Preprocessing techniques

bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and

resolve inconsistencies

bull Data integrationndash Integration of multiple databases data cubes or files

bull Data transformationndash Normalization and aggregation

bull Data reductionndash Obtains reduced representation in volume but produces the same or

similar analytical results

bull Data discretizationndash Part of data reduction but with particular importance especially for

numerical data

AAST-Comp eng 2504072023

Results

Data preprocessing

Feature selection

Classification

Selection tool datamining

Performance evaluation Cycle

Dataset

Finding a feature subset that has the most discriminative information from the original feature space

The objective of feature selection is bull Improving the prediction performance of the

predictorsbull Providing a faster and more cost-effective

predictorsbull Providing a better understanding of the underlying

process that generated the data

Feature selection

AAST-Comp eng 2704072023

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 6: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Breast Cancer

6

bull The second leading cause of death among women is breast cancer as it comes directly after lung cancer

bull Breast cancer considered the most common invasive cancer in women with more than one million cases and nearly 600000 deaths occurring worldwide annually

bull Breast cancer comes in the top of cancer list in Egypt by 42 cases per 100 thousand of the population However 80 of the cases of breast cancer in Egypt are of the benign kind

AAST-Comp eng04072023

History and Background

Medical Prognosis is the estimation of bull Curebull Complicationbull disease recurrencebull Survival for a patient or group of patients after treatment

7AAST-Comp eng04072023

Breast Cancer Classification

8AAST-Comp eng

Round well-defined larger groups are more likely benign

Tight cluster of tiny irregularly shaped groups may indicate cancer Malignant

Suspicious pixels groups show up as white spots on a mammogram

04072023

Breast cancerrsquos Featuresbull MRI - Cancer can have a unique appearance ndash

features that turned out to be cancer used for diagnosis prognosis of each cell nucleus

9AAST-Comp eng

F2Magnetic Resonance Image

F1

F3

Fn

Feature

Extraction

04072023

Diagnosis or prognosis

Brest CancerBenign

Malignant

AAST-Comp eng 1004072023

04072023 AAST-Comp eng 11

Computer-Aided Diagnosis

bull Mammography allows for efficient diagnosis of breast cancers at an earlier stage

bull Radiologists misdiagnose 10-30 of the malignant cases

bull Of the cases sent for surgical biopsy only 10-20 are actually malignant

Computational Intelligence

Computational IntelligenceData + Knowledge

Artificial Intelligence

Expert systems

Fuzzylogic

PatternRecognition

Machinelearning

Probabilistic methods

Multivariatestatistics

Visuali-zation

Evolutionaryalgorithms

Neuralnetworks

04072023 AAST-Comp eng 12

What do these methods do

bull Provide non-parametric models of databull Allow to classify new data to pre-defined

categories supporting diagnosis amp prognosis

bull Allow to discover new categoriesbull Allow to understand the data creating fuzzy

or crisp logical rulesbull Help to visualize multi-dimensional

relationships among data samples 04072023 AAST-Comp eng 13

14

Feature selection

Data Preprocessing

Selecting Data mining tool dataset

Classification algorithm

SMO IBK BF TREE

Results and evaluationsAAST-Comp eng

Pattern recognition system decomposition

04072023

Results

Data preprocessing

Feature selectionClassification

Selection tool data mining

Performance evaluation Cycle

Dataset

data sets

AAST-Comp eng 1604072023

results

Data preprocessing

Feature selectionclassification

Selection tool datamining

Performance evaluation Cycle

Dataset

AAST-Comp eng 18

Data Mining

bull Data Mining is set of techniques used in various domains to give meaning to the available data

bull Objective Fit data to a modelndashDescriptivendashPredictive

04072023

Predictive amp descriptive data mining

bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples

bull Descriptive Is to describe the general or special features of a set of data in a concise manner

AAST-Comp eng 1904072023

AAST-Comp eng 20

Data Mining Models and Tasks

04072023

Data mining Tools

Many advanced tools for data mining are available either as open-source or commercial software

21AAST-Comp eng04072023

wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for

data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code

bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes

bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature

04072023 AAST-Comp eng 22

Results

Data preprocessing

Feature selection Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Data Preprocessing

bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes

of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names

bull Quality decisions must be based on quality data measures

Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility

AAST-Comp eng 2404072023

Preprocessing techniques

bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and

resolve inconsistencies

bull Data integrationndash Integration of multiple databases data cubes or files

bull Data transformationndash Normalization and aggregation

bull Data reductionndash Obtains reduced representation in volume but produces the same or

similar analytical results

bull Data discretizationndash Part of data reduction but with particular importance especially for

numerical data

AAST-Comp eng 2504072023

Results

Data preprocessing

Feature selection

Classification

Selection tool datamining

Performance evaluation Cycle

Dataset

Finding a feature subset that has the most discriminative information from the original feature space

The objective of feature selection is bull Improving the prediction performance of the

predictorsbull Providing a faster and more cost-effective

predictorsbull Providing a better understanding of the underlying

process that generated the data

Feature selection

AAST-Comp eng 2704072023

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 7: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

History and Background

Medical Prognosis is the estimation of bull Curebull Complicationbull disease recurrencebull Survival for a patient or group of patients after treatment

7AAST-Comp eng04072023

Breast Cancer Classification

8AAST-Comp eng

Round well-defined larger groups are more likely benign

Tight cluster of tiny irregularly shaped groups may indicate cancer Malignant

Suspicious pixels groups show up as white spots on a mammogram

04072023

Breast cancerrsquos Featuresbull MRI - Cancer can have a unique appearance ndash

features that turned out to be cancer used for diagnosis prognosis of each cell nucleus

9AAST-Comp eng

F2Magnetic Resonance Image

F1

F3

Fn

Feature

Extraction

04072023

Diagnosis or prognosis

Brest CancerBenign

Malignant

AAST-Comp eng 1004072023

04072023 AAST-Comp eng 11

Computer-Aided Diagnosis

bull Mammography allows for efficient diagnosis of breast cancers at an earlier stage

bull Radiologists misdiagnose 10-30 of the malignant cases

bull Of the cases sent for surgical biopsy only 10-20 are actually malignant

Computational Intelligence

Computational IntelligenceData + Knowledge

Artificial Intelligence

Expert systems

Fuzzylogic

PatternRecognition

Machinelearning

Probabilistic methods

Multivariatestatistics

Visuali-zation

Evolutionaryalgorithms

Neuralnetworks

04072023 AAST-Comp eng 12

What do these methods do

bull Provide non-parametric models of databull Allow to classify new data to pre-defined

categories supporting diagnosis amp prognosis

bull Allow to discover new categoriesbull Allow to understand the data creating fuzzy

or crisp logical rulesbull Help to visualize multi-dimensional

relationships among data samples 04072023 AAST-Comp eng 13

14

Feature selection

Data Preprocessing

Selecting Data mining tool dataset

Classification algorithm

SMO IBK BF TREE

Results and evaluationsAAST-Comp eng

Pattern recognition system decomposition

04072023

Results

Data preprocessing

Feature selectionClassification

Selection tool data mining

Performance evaluation Cycle

Dataset

data sets

AAST-Comp eng 1604072023

results

Data preprocessing

Feature selectionclassification

Selection tool datamining

Performance evaluation Cycle

Dataset

AAST-Comp eng 18

Data Mining

bull Data Mining is set of techniques used in various domains to give meaning to the available data

bull Objective Fit data to a modelndashDescriptivendashPredictive

04072023

Predictive amp descriptive data mining

bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples

bull Descriptive Is to describe the general or special features of a set of data in a concise manner

AAST-Comp eng 1904072023

AAST-Comp eng 20

Data Mining Models and Tasks

04072023

Data mining Tools

Many advanced tools for data mining are available either as open-source or commercial software

21AAST-Comp eng04072023

wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for

data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code

bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes

bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature

04072023 AAST-Comp eng 22

Results

Data preprocessing

Feature selection Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Data Preprocessing

bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes

of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names

bull Quality decisions must be based on quality data measures

Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility

AAST-Comp eng 2404072023

Preprocessing techniques

bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and

resolve inconsistencies

bull Data integrationndash Integration of multiple databases data cubes or files

bull Data transformationndash Normalization and aggregation

bull Data reductionndash Obtains reduced representation in volume but produces the same or

similar analytical results

bull Data discretizationndash Part of data reduction but with particular importance especially for

numerical data

AAST-Comp eng 2504072023

Results

Data preprocessing

Feature selection

Classification

Selection tool datamining

Performance evaluation Cycle

Dataset

Finding a feature subset that has the most discriminative information from the original feature space

The objective of feature selection is bull Improving the prediction performance of the

predictorsbull Providing a faster and more cost-effective

predictorsbull Providing a better understanding of the underlying

process that generated the data

Feature selection

AAST-Comp eng 2704072023

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 8: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Breast Cancer Classification

8AAST-Comp eng

Round well-defined larger groups are more likely benign

Tight cluster of tiny irregularly shaped groups may indicate cancer Malignant

Suspicious pixels groups show up as white spots on a mammogram

04072023

Breast cancerrsquos Featuresbull MRI - Cancer can have a unique appearance ndash

features that turned out to be cancer used for diagnosis prognosis of each cell nucleus

9AAST-Comp eng

F2Magnetic Resonance Image

F1

F3

Fn

Feature

Extraction

04072023

Diagnosis or prognosis

Brest CancerBenign

Malignant

AAST-Comp eng 1004072023

04072023 AAST-Comp eng 11

Computer-Aided Diagnosis

bull Mammography allows for efficient diagnosis of breast cancers at an earlier stage

bull Radiologists misdiagnose 10-30 of the malignant cases

bull Of the cases sent for surgical biopsy only 10-20 are actually malignant

Computational Intelligence

Computational IntelligenceData + Knowledge

Artificial Intelligence

Expert systems

Fuzzylogic

PatternRecognition

Machinelearning

Probabilistic methods

Multivariatestatistics

Visuali-zation

Evolutionaryalgorithms

Neuralnetworks

04072023 AAST-Comp eng 12

What do these methods do

bull Provide non-parametric models of databull Allow to classify new data to pre-defined

categories supporting diagnosis amp prognosis

bull Allow to discover new categoriesbull Allow to understand the data creating fuzzy

or crisp logical rulesbull Help to visualize multi-dimensional

relationships among data samples 04072023 AAST-Comp eng 13

14

Feature selection

Data Preprocessing

Selecting Data mining tool dataset

Classification algorithm

SMO IBK BF TREE

Results and evaluationsAAST-Comp eng

Pattern recognition system decomposition

04072023

Results

Data preprocessing

Feature selectionClassification

Selection tool data mining

Performance evaluation Cycle

Dataset

data sets

AAST-Comp eng 1604072023

results

Data preprocessing

Feature selectionclassification

Selection tool datamining

Performance evaluation Cycle

Dataset

AAST-Comp eng 18

Data Mining

bull Data Mining is set of techniques used in various domains to give meaning to the available data

bull Objective Fit data to a modelndashDescriptivendashPredictive

04072023

Predictive amp descriptive data mining

bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples

bull Descriptive Is to describe the general or special features of a set of data in a concise manner

AAST-Comp eng 1904072023

AAST-Comp eng 20

Data Mining Models and Tasks

04072023

Data mining Tools

Many advanced tools for data mining are available either as open-source or commercial software

21AAST-Comp eng04072023

wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for

data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code

bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes

bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature

04072023 AAST-Comp eng 22

Results

Data preprocessing

Feature selection Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Data Preprocessing

bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes

of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names

bull Quality decisions must be based on quality data measures

Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility

AAST-Comp eng 2404072023

Preprocessing techniques

bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and

resolve inconsistencies

bull Data integrationndash Integration of multiple databases data cubes or files

bull Data transformationndash Normalization and aggregation

bull Data reductionndash Obtains reduced representation in volume but produces the same or

similar analytical results

bull Data discretizationndash Part of data reduction but with particular importance especially for

numerical data

AAST-Comp eng 2504072023

Results

Data preprocessing

Feature selection

Classification

Selection tool datamining

Performance evaluation Cycle

Dataset

Finding a feature subset that has the most discriminative information from the original feature space

The objective of feature selection is bull Improving the prediction performance of the

predictorsbull Providing a faster and more cost-effective

predictorsbull Providing a better understanding of the underlying

process that generated the data

Feature selection

AAST-Comp eng 2704072023

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 9: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Breast cancerrsquos Featuresbull MRI - Cancer can have a unique appearance ndash

features that turned out to be cancer used for diagnosis prognosis of each cell nucleus

9AAST-Comp eng

F2Magnetic Resonance Image

F1

F3

Fn

Feature

Extraction

04072023

Diagnosis or prognosis

Brest CancerBenign

Malignant

AAST-Comp eng 1004072023

04072023 AAST-Comp eng 11

Computer-Aided Diagnosis

bull Mammography allows for efficient diagnosis of breast cancers at an earlier stage

bull Radiologists misdiagnose 10-30 of the malignant cases

bull Of the cases sent for surgical biopsy only 10-20 are actually malignant

Computational Intelligence

Computational IntelligenceData + Knowledge

Artificial Intelligence

Expert systems

Fuzzylogic

PatternRecognition

Machinelearning

Probabilistic methods

Multivariatestatistics

Visuali-zation

Evolutionaryalgorithms

Neuralnetworks

04072023 AAST-Comp eng 12

What do these methods do

bull Provide non-parametric models of databull Allow to classify new data to pre-defined

categories supporting diagnosis amp prognosis

bull Allow to discover new categoriesbull Allow to understand the data creating fuzzy

or crisp logical rulesbull Help to visualize multi-dimensional

relationships among data samples 04072023 AAST-Comp eng 13

14

Feature selection

Data Preprocessing

Selecting Data mining tool dataset

Classification algorithm

SMO IBK BF TREE

Results and evaluationsAAST-Comp eng

Pattern recognition system decomposition

04072023

Results

Data preprocessing

Feature selectionClassification

Selection tool data mining

Performance evaluation Cycle

Dataset

data sets

AAST-Comp eng 1604072023

results

Data preprocessing

Feature selectionclassification

Selection tool datamining

Performance evaluation Cycle

Dataset

AAST-Comp eng 18

Data Mining

bull Data Mining is set of techniques used in various domains to give meaning to the available data

bull Objective Fit data to a modelndashDescriptivendashPredictive

04072023

Predictive amp descriptive data mining

bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples

bull Descriptive Is to describe the general or special features of a set of data in a concise manner

AAST-Comp eng 1904072023

AAST-Comp eng 20

Data Mining Models and Tasks

04072023

Data mining Tools

Many advanced tools for data mining are available either as open-source or commercial software

21AAST-Comp eng04072023

wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for

data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code

bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes

bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature

04072023 AAST-Comp eng 22

Results

Data preprocessing

Feature selection Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Data Preprocessing

bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes

of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names

bull Quality decisions must be based on quality data measures

Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility

AAST-Comp eng 2404072023

Preprocessing techniques

bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and

resolve inconsistencies

bull Data integrationndash Integration of multiple databases data cubes or files

bull Data transformationndash Normalization and aggregation

bull Data reductionndash Obtains reduced representation in volume but produces the same or

similar analytical results

bull Data discretizationndash Part of data reduction but with particular importance especially for

numerical data

AAST-Comp eng 2504072023

Results

Data preprocessing

Feature selection

Classification

Selection tool datamining

Performance evaluation Cycle

Dataset

Finding a feature subset that has the most discriminative information from the original feature space

The objective of feature selection is bull Improving the prediction performance of the

predictorsbull Providing a faster and more cost-effective

predictorsbull Providing a better understanding of the underlying

process that generated the data

Feature selection

AAST-Comp eng 2704072023

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 10: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Diagnosis or prognosis

Brest CancerBenign

Malignant

AAST-Comp eng 1004072023

04072023 AAST-Comp eng 11

Computer-Aided Diagnosis

bull Mammography allows for efficient diagnosis of breast cancers at an earlier stage

bull Radiologists misdiagnose 10-30 of the malignant cases

bull Of the cases sent for surgical biopsy only 10-20 are actually malignant

Computational Intelligence

Computational IntelligenceData + Knowledge

Artificial Intelligence

Expert systems

Fuzzylogic

PatternRecognition

Machinelearning

Probabilistic methods

Multivariatestatistics

Visuali-zation

Evolutionaryalgorithms

Neuralnetworks

04072023 AAST-Comp eng 12

What do these methods do

bull Provide non-parametric models of databull Allow to classify new data to pre-defined

categories supporting diagnosis amp prognosis

bull Allow to discover new categoriesbull Allow to understand the data creating fuzzy

or crisp logical rulesbull Help to visualize multi-dimensional

relationships among data samples 04072023 AAST-Comp eng 13

14

Feature selection

Data Preprocessing

Selecting Data mining tool dataset

Classification algorithm

SMO IBK BF TREE

Results and evaluationsAAST-Comp eng

Pattern recognition system decomposition

04072023

Results

Data preprocessing

Feature selectionClassification

Selection tool data mining

Performance evaluation Cycle

Dataset

data sets

AAST-Comp eng 1604072023

results

Data preprocessing

Feature selectionclassification

Selection tool datamining

Performance evaluation Cycle

Dataset

AAST-Comp eng 18

Data Mining

bull Data Mining is set of techniques used in various domains to give meaning to the available data

bull Objective Fit data to a modelndashDescriptivendashPredictive

04072023

Predictive amp descriptive data mining

bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples

bull Descriptive Is to describe the general or special features of a set of data in a concise manner

AAST-Comp eng 1904072023

AAST-Comp eng 20

Data Mining Models and Tasks

04072023

Data mining Tools

Many advanced tools for data mining are available either as open-source or commercial software

21AAST-Comp eng04072023

wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for

data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code

bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes

bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature

04072023 AAST-Comp eng 22

Results

Data preprocessing

Feature selection Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Data Preprocessing

bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes

of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names

bull Quality decisions must be based on quality data measures

Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility

AAST-Comp eng 2404072023

Preprocessing techniques

bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and

resolve inconsistencies

bull Data integrationndash Integration of multiple databases data cubes or files

bull Data transformationndash Normalization and aggregation

bull Data reductionndash Obtains reduced representation in volume but produces the same or

similar analytical results

bull Data discretizationndash Part of data reduction but with particular importance especially for

numerical data

AAST-Comp eng 2504072023

Results

Data preprocessing

Feature selection

Classification

Selection tool datamining

Performance evaluation Cycle

Dataset

Finding a feature subset that has the most discriminative information from the original feature space

The objective of feature selection is bull Improving the prediction performance of the

predictorsbull Providing a faster and more cost-effective

predictorsbull Providing a better understanding of the underlying

process that generated the data

Feature selection

AAST-Comp eng 2704072023

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 11: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023 AAST-Comp eng 11

Computer-Aided Diagnosis

bull Mammography allows for efficient diagnosis of breast cancers at an earlier stage

bull Radiologists misdiagnose 10-30 of the malignant cases

bull Of the cases sent for surgical biopsy only 10-20 are actually malignant

Computational Intelligence

Computational IntelligenceData + Knowledge

Artificial Intelligence

Expert systems

Fuzzylogic

PatternRecognition

Machinelearning

Probabilistic methods

Multivariatestatistics

Visuali-zation

Evolutionaryalgorithms

Neuralnetworks

04072023 AAST-Comp eng 12

What do these methods do

bull Provide non-parametric models of databull Allow to classify new data to pre-defined

categories supporting diagnosis amp prognosis

bull Allow to discover new categoriesbull Allow to understand the data creating fuzzy

or crisp logical rulesbull Help to visualize multi-dimensional

relationships among data samples 04072023 AAST-Comp eng 13

14

Feature selection

Data Preprocessing

Selecting Data mining tool dataset

Classification algorithm

SMO IBK BF TREE

Results and evaluationsAAST-Comp eng

Pattern recognition system decomposition

04072023

Results

Data preprocessing

Feature selectionClassification

Selection tool data mining

Performance evaluation Cycle

Dataset

data sets

AAST-Comp eng 1604072023

results

Data preprocessing

Feature selectionclassification

Selection tool datamining

Performance evaluation Cycle

Dataset

AAST-Comp eng 18

Data Mining

bull Data Mining is set of techniques used in various domains to give meaning to the available data

bull Objective Fit data to a modelndashDescriptivendashPredictive

04072023

Predictive amp descriptive data mining

bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples

bull Descriptive Is to describe the general or special features of a set of data in a concise manner

AAST-Comp eng 1904072023

AAST-Comp eng 20

Data Mining Models and Tasks

04072023

Data mining Tools

Many advanced tools for data mining are available either as open-source or commercial software

21AAST-Comp eng04072023

wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for

data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code

bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes

bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature

04072023 AAST-Comp eng 22

Results

Data preprocessing

Feature selection Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Data Preprocessing

bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes

of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names

bull Quality decisions must be based on quality data measures

Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility

AAST-Comp eng 2404072023

Preprocessing techniques

bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and

resolve inconsistencies

bull Data integrationndash Integration of multiple databases data cubes or files

bull Data transformationndash Normalization and aggregation

bull Data reductionndash Obtains reduced representation in volume but produces the same or

similar analytical results

bull Data discretizationndash Part of data reduction but with particular importance especially for

numerical data

AAST-Comp eng 2504072023

Results

Data preprocessing

Feature selection

Classification

Selection tool datamining

Performance evaluation Cycle

Dataset

Finding a feature subset that has the most discriminative information from the original feature space

The objective of feature selection is bull Improving the prediction performance of the

predictorsbull Providing a faster and more cost-effective

predictorsbull Providing a better understanding of the underlying

process that generated the data

Feature selection

AAST-Comp eng 2704072023

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 12: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Computational Intelligence

Computational IntelligenceData + Knowledge

Artificial Intelligence

Expert systems

Fuzzylogic

PatternRecognition

Machinelearning

Probabilistic methods

Multivariatestatistics

Visuali-zation

Evolutionaryalgorithms

Neuralnetworks

04072023 AAST-Comp eng 12

What do these methods do

bull Provide non-parametric models of databull Allow to classify new data to pre-defined

categories supporting diagnosis amp prognosis

bull Allow to discover new categoriesbull Allow to understand the data creating fuzzy

or crisp logical rulesbull Help to visualize multi-dimensional

relationships among data samples 04072023 AAST-Comp eng 13

14

Feature selection

Data Preprocessing

Selecting Data mining tool dataset

Classification algorithm

SMO IBK BF TREE

Results and evaluationsAAST-Comp eng

Pattern recognition system decomposition

04072023

Results

Data preprocessing

Feature selectionClassification

Selection tool data mining

Performance evaluation Cycle

Dataset

data sets

AAST-Comp eng 1604072023

results

Data preprocessing

Feature selectionclassification

Selection tool datamining

Performance evaluation Cycle

Dataset

AAST-Comp eng 18

Data Mining

bull Data Mining is set of techniques used in various domains to give meaning to the available data

bull Objective Fit data to a modelndashDescriptivendashPredictive

04072023

Predictive amp descriptive data mining

bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples

bull Descriptive Is to describe the general or special features of a set of data in a concise manner

AAST-Comp eng 1904072023

AAST-Comp eng 20

Data Mining Models and Tasks

04072023

Data mining Tools

Many advanced tools for data mining are available either as open-source or commercial software

21AAST-Comp eng04072023

wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for

data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code

bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes

bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature

04072023 AAST-Comp eng 22

Results

Data preprocessing

Feature selection Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Data Preprocessing

bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes

of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names

bull Quality decisions must be based on quality data measures

Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility

AAST-Comp eng 2404072023

Preprocessing techniques

bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and

resolve inconsistencies

bull Data integrationndash Integration of multiple databases data cubes or files

bull Data transformationndash Normalization and aggregation

bull Data reductionndash Obtains reduced representation in volume but produces the same or

similar analytical results

bull Data discretizationndash Part of data reduction but with particular importance especially for

numerical data

AAST-Comp eng 2504072023

Results

Data preprocessing

Feature selection

Classification

Selection tool datamining

Performance evaluation Cycle

Dataset

Finding a feature subset that has the most discriminative information from the original feature space

The objective of feature selection is bull Improving the prediction performance of the

predictorsbull Providing a faster and more cost-effective

predictorsbull Providing a better understanding of the underlying

process that generated the data

Feature selection

AAST-Comp eng 2704072023

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 13: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

What do these methods do

bull Provide non-parametric models of databull Allow to classify new data to pre-defined

categories supporting diagnosis amp prognosis

bull Allow to discover new categoriesbull Allow to understand the data creating fuzzy

or crisp logical rulesbull Help to visualize multi-dimensional

relationships among data samples 04072023 AAST-Comp eng 13

14

Feature selection

Data Preprocessing

Selecting Data mining tool dataset

Classification algorithm

SMO IBK BF TREE

Results and evaluationsAAST-Comp eng

Pattern recognition system decomposition

04072023

Results

Data preprocessing

Feature selectionClassification

Selection tool data mining

Performance evaluation Cycle

Dataset

data sets

AAST-Comp eng 1604072023

results

Data preprocessing

Feature selectionclassification

Selection tool datamining

Performance evaluation Cycle

Dataset

AAST-Comp eng 18

Data Mining

bull Data Mining is set of techniques used in various domains to give meaning to the available data

bull Objective Fit data to a modelndashDescriptivendashPredictive

04072023

Predictive amp descriptive data mining

bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples

bull Descriptive Is to describe the general or special features of a set of data in a concise manner

AAST-Comp eng 1904072023

AAST-Comp eng 20

Data Mining Models and Tasks

04072023

Data mining Tools

Many advanced tools for data mining are available either as open-source or commercial software

21AAST-Comp eng04072023

wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for

data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code

bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes

bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature

04072023 AAST-Comp eng 22

Results

Data preprocessing

Feature selection Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Data Preprocessing

bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes

of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names

bull Quality decisions must be based on quality data measures

Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility

AAST-Comp eng 2404072023

Preprocessing techniques

bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and

resolve inconsistencies

bull Data integrationndash Integration of multiple databases data cubes or files

bull Data transformationndash Normalization and aggregation

bull Data reductionndash Obtains reduced representation in volume but produces the same or

similar analytical results

bull Data discretizationndash Part of data reduction but with particular importance especially for

numerical data

AAST-Comp eng 2504072023

Results

Data preprocessing

Feature selection

Classification

Selection tool datamining

Performance evaluation Cycle

Dataset

Finding a feature subset that has the most discriminative information from the original feature space

The objective of feature selection is bull Improving the prediction performance of the

predictorsbull Providing a faster and more cost-effective

predictorsbull Providing a better understanding of the underlying

process that generated the data

Feature selection

AAST-Comp eng 2704072023

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 14: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

14

Feature selection

Data Preprocessing

Selecting Data mining tool dataset

Classification algorithm

SMO IBK BF TREE

Results and evaluationsAAST-Comp eng

Pattern recognition system decomposition

04072023

Results

Data preprocessing

Feature selectionClassification

Selection tool data mining

Performance evaluation Cycle

Dataset

data sets

AAST-Comp eng 1604072023

results

Data preprocessing

Feature selectionclassification

Selection tool datamining

Performance evaluation Cycle

Dataset

AAST-Comp eng 18

Data Mining

bull Data Mining is set of techniques used in various domains to give meaning to the available data

bull Objective Fit data to a modelndashDescriptivendashPredictive

04072023

Predictive amp descriptive data mining

bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples

bull Descriptive Is to describe the general or special features of a set of data in a concise manner

AAST-Comp eng 1904072023

AAST-Comp eng 20

Data Mining Models and Tasks

04072023

Data mining Tools

Many advanced tools for data mining are available either as open-source or commercial software

21AAST-Comp eng04072023

wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for

data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code

bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes

bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature

04072023 AAST-Comp eng 22

Results

Data preprocessing

Feature selection Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Data Preprocessing

bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes

of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names

bull Quality decisions must be based on quality data measures

Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility

AAST-Comp eng 2404072023

Preprocessing techniques

bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and

resolve inconsistencies

bull Data integrationndash Integration of multiple databases data cubes or files

bull Data transformationndash Normalization and aggregation

bull Data reductionndash Obtains reduced representation in volume but produces the same or

similar analytical results

bull Data discretizationndash Part of data reduction but with particular importance especially for

numerical data

AAST-Comp eng 2504072023

Results

Data preprocessing

Feature selection

Classification

Selection tool datamining

Performance evaluation Cycle

Dataset

Finding a feature subset that has the most discriminative information from the original feature space

The objective of feature selection is bull Improving the prediction performance of the

predictorsbull Providing a faster and more cost-effective

predictorsbull Providing a better understanding of the underlying

process that generated the data

Feature selection

AAST-Comp eng 2704072023

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 15: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Results

Data preprocessing

Feature selectionClassification

Selection tool data mining

Performance evaluation Cycle

Dataset

data sets

AAST-Comp eng 1604072023

results

Data preprocessing

Feature selectionclassification

Selection tool datamining

Performance evaluation Cycle

Dataset

AAST-Comp eng 18

Data Mining

bull Data Mining is set of techniques used in various domains to give meaning to the available data

bull Objective Fit data to a modelndashDescriptivendashPredictive

04072023

Predictive amp descriptive data mining

bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples

bull Descriptive Is to describe the general or special features of a set of data in a concise manner

AAST-Comp eng 1904072023

AAST-Comp eng 20

Data Mining Models and Tasks

04072023

Data mining Tools

Many advanced tools for data mining are available either as open-source or commercial software

21AAST-Comp eng04072023

wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for

data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code

bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes

bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature

04072023 AAST-Comp eng 22

Results

Data preprocessing

Feature selection Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Data Preprocessing

bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes

of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names

bull Quality decisions must be based on quality data measures

Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility

AAST-Comp eng 2404072023

Preprocessing techniques

bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and

resolve inconsistencies

bull Data integrationndash Integration of multiple databases data cubes or files

bull Data transformationndash Normalization and aggregation

bull Data reductionndash Obtains reduced representation in volume but produces the same or

similar analytical results

bull Data discretizationndash Part of data reduction but with particular importance especially for

numerical data

AAST-Comp eng 2504072023

Results

Data preprocessing

Feature selection

Classification

Selection tool datamining

Performance evaluation Cycle

Dataset

Finding a feature subset that has the most discriminative information from the original feature space

The objective of feature selection is bull Improving the prediction performance of the

predictorsbull Providing a faster and more cost-effective

predictorsbull Providing a better understanding of the underlying

process that generated the data

Feature selection

AAST-Comp eng 2704072023

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 16: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

data sets

AAST-Comp eng 1604072023

results

Data preprocessing

Feature selectionclassification

Selection tool datamining

Performance evaluation Cycle

Dataset

AAST-Comp eng 18

Data Mining

bull Data Mining is set of techniques used in various domains to give meaning to the available data

bull Objective Fit data to a modelndashDescriptivendashPredictive

04072023

Predictive amp descriptive data mining

bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples

bull Descriptive Is to describe the general or special features of a set of data in a concise manner

AAST-Comp eng 1904072023

AAST-Comp eng 20

Data Mining Models and Tasks

04072023

Data mining Tools

Many advanced tools for data mining are available either as open-source or commercial software

21AAST-Comp eng04072023

wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for

data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code

bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes

bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature

04072023 AAST-Comp eng 22

Results

Data preprocessing

Feature selection Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Data Preprocessing

bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes

of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names

bull Quality decisions must be based on quality data measures

Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility

AAST-Comp eng 2404072023

Preprocessing techniques

bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and

resolve inconsistencies

bull Data integrationndash Integration of multiple databases data cubes or files

bull Data transformationndash Normalization and aggregation

bull Data reductionndash Obtains reduced representation in volume but produces the same or

similar analytical results

bull Data discretizationndash Part of data reduction but with particular importance especially for

numerical data

AAST-Comp eng 2504072023

Results

Data preprocessing

Feature selection

Classification

Selection tool datamining

Performance evaluation Cycle

Dataset

Finding a feature subset that has the most discriminative information from the original feature space

The objective of feature selection is bull Improving the prediction performance of the

predictorsbull Providing a faster and more cost-effective

predictorsbull Providing a better understanding of the underlying

process that generated the data

Feature selection

AAST-Comp eng 2704072023

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 17: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

results

Data preprocessing

Feature selectionclassification

Selection tool datamining

Performance evaluation Cycle

Dataset

AAST-Comp eng 18

Data Mining

bull Data Mining is set of techniques used in various domains to give meaning to the available data

bull Objective Fit data to a modelndashDescriptivendashPredictive

04072023

Predictive amp descriptive data mining

bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples

bull Descriptive Is to describe the general or special features of a set of data in a concise manner

AAST-Comp eng 1904072023

AAST-Comp eng 20

Data Mining Models and Tasks

04072023

Data mining Tools

Many advanced tools for data mining are available either as open-source or commercial software

21AAST-Comp eng04072023

wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for

data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code

bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes

bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature

04072023 AAST-Comp eng 22

Results

Data preprocessing

Feature selection Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Data Preprocessing

bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes

of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names

bull Quality decisions must be based on quality data measures

Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility

AAST-Comp eng 2404072023

Preprocessing techniques

bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and

resolve inconsistencies

bull Data integrationndash Integration of multiple databases data cubes or files

bull Data transformationndash Normalization and aggregation

bull Data reductionndash Obtains reduced representation in volume but produces the same or

similar analytical results

bull Data discretizationndash Part of data reduction but with particular importance especially for

numerical data

AAST-Comp eng 2504072023

Results

Data preprocessing

Feature selection

Classification

Selection tool datamining

Performance evaluation Cycle

Dataset

Finding a feature subset that has the most discriminative information from the original feature space

The objective of feature selection is bull Improving the prediction performance of the

predictorsbull Providing a faster and more cost-effective

predictorsbull Providing a better understanding of the underlying

process that generated the data

Feature selection

AAST-Comp eng 2704072023

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 18: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

AAST-Comp eng 18

Data Mining

bull Data Mining is set of techniques used in various domains to give meaning to the available data

bull Objective Fit data to a modelndashDescriptivendashPredictive

04072023

Predictive amp descriptive data mining

bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples

bull Descriptive Is to describe the general or special features of a set of data in a concise manner

AAST-Comp eng 1904072023

AAST-Comp eng 20

Data Mining Models and Tasks

04072023

Data mining Tools

Many advanced tools for data mining are available either as open-source or commercial software

21AAST-Comp eng04072023

wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for

data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code

bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes

bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature

04072023 AAST-Comp eng 22

Results

Data preprocessing

Feature selection Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Data Preprocessing

bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes

of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names

bull Quality decisions must be based on quality data measures

Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility

AAST-Comp eng 2404072023

Preprocessing techniques

bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and

resolve inconsistencies

bull Data integrationndash Integration of multiple databases data cubes or files

bull Data transformationndash Normalization and aggregation

bull Data reductionndash Obtains reduced representation in volume but produces the same or

similar analytical results

bull Data discretizationndash Part of data reduction but with particular importance especially for

numerical data

AAST-Comp eng 2504072023

Results

Data preprocessing

Feature selection

Classification

Selection tool datamining

Performance evaluation Cycle

Dataset

Finding a feature subset that has the most discriminative information from the original feature space

The objective of feature selection is bull Improving the prediction performance of the

predictorsbull Providing a faster and more cost-effective

predictorsbull Providing a better understanding of the underlying

process that generated the data

Feature selection

AAST-Comp eng 2704072023

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 19: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Predictive amp descriptive data mining

bull Predictive Is the process of automatically creating a classification model from a set of examples called the training set which belongs to a set of classes Once a model is created it can be used to automatically predict the class of other unclassified examples

bull Descriptive Is to describe the general or special features of a set of data in a concise manner

AAST-Comp eng 1904072023

AAST-Comp eng 20

Data Mining Models and Tasks

04072023

Data mining Tools

Many advanced tools for data mining are available either as open-source or commercial software

21AAST-Comp eng04072023

wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for

data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code

bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes

bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature

04072023 AAST-Comp eng 22

Results

Data preprocessing

Feature selection Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Data Preprocessing

bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes

of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names

bull Quality decisions must be based on quality data measures

Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility

AAST-Comp eng 2404072023

Preprocessing techniques

bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and

resolve inconsistencies

bull Data integrationndash Integration of multiple databases data cubes or files

bull Data transformationndash Normalization and aggregation

bull Data reductionndash Obtains reduced representation in volume but produces the same or

similar analytical results

bull Data discretizationndash Part of data reduction but with particular importance especially for

numerical data

AAST-Comp eng 2504072023

Results

Data preprocessing

Feature selection

Classification

Selection tool datamining

Performance evaluation Cycle

Dataset

Finding a feature subset that has the most discriminative information from the original feature space

The objective of feature selection is bull Improving the prediction performance of the

predictorsbull Providing a faster and more cost-effective

predictorsbull Providing a better understanding of the underlying

process that generated the data

Feature selection

AAST-Comp eng 2704072023

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 20: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

AAST-Comp eng 20

Data Mining Models and Tasks

04072023

Data mining Tools

Many advanced tools for data mining are available either as open-source or commercial software

21AAST-Comp eng04072023

wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for

data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code

bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes

bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature

04072023 AAST-Comp eng 22

Results

Data preprocessing

Feature selection Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Data Preprocessing

bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes

of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names

bull Quality decisions must be based on quality data measures

Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility

AAST-Comp eng 2404072023

Preprocessing techniques

bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and

resolve inconsistencies

bull Data integrationndash Integration of multiple databases data cubes or files

bull Data transformationndash Normalization and aggregation

bull Data reductionndash Obtains reduced representation in volume but produces the same or

similar analytical results

bull Data discretizationndash Part of data reduction but with particular importance especially for

numerical data

AAST-Comp eng 2504072023

Results

Data preprocessing

Feature selection

Classification

Selection tool datamining

Performance evaluation Cycle

Dataset

Finding a feature subset that has the most discriminative information from the original feature space

The objective of feature selection is bull Improving the prediction performance of the

predictorsbull Providing a faster and more cost-effective

predictorsbull Providing a better understanding of the underlying

process that generated the data

Feature selection

AAST-Comp eng 2704072023

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 21: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Data mining Tools

Many advanced tools for data mining are available either as open-source or commercial software

21AAST-Comp eng04072023

wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for

data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code

bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes

bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature

04072023 AAST-Comp eng 22

Results

Data preprocessing

Feature selection Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Data Preprocessing

bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes

of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names

bull Quality decisions must be based on quality data measures

Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility

AAST-Comp eng 2404072023

Preprocessing techniques

bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and

resolve inconsistencies

bull Data integrationndash Integration of multiple databases data cubes or files

bull Data transformationndash Normalization and aggregation

bull Data reductionndash Obtains reduced representation in volume but produces the same or

similar analytical results

bull Data discretizationndash Part of data reduction but with particular importance especially for

numerical data

AAST-Comp eng 2504072023

Results

Data preprocessing

Feature selection

Classification

Selection tool datamining

Performance evaluation Cycle

Dataset

Finding a feature subset that has the most discriminative information from the original feature space

The objective of feature selection is bull Improving the prediction performance of the

predictorsbull Providing a faster and more cost-effective

predictorsbull Providing a better understanding of the underlying

process that generated the data

Feature selection

AAST-Comp eng 2704072023

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 22: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

wekabull Waikato environment for knowledge analysisbull Weka is a collection of machine learning algorithms for

data mining tasks The algorithms can either be applied directly to a dataset or called from your own Java code

bull Weka contains tools for data pre-processing classification regression clustering association rules and visualization It is also well-suited for developing new machine learning schemes

bull Found only on the islands of New Zealand the Weka is a flightless bird with an inquisitive nature

04072023 AAST-Comp eng 22

Results

Data preprocessing

Feature selection Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Data Preprocessing

bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes

of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names

bull Quality decisions must be based on quality data measures

Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility

AAST-Comp eng 2404072023

Preprocessing techniques

bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and

resolve inconsistencies

bull Data integrationndash Integration of multiple databases data cubes or files

bull Data transformationndash Normalization and aggregation

bull Data reductionndash Obtains reduced representation in volume but produces the same or

similar analytical results

bull Data discretizationndash Part of data reduction but with particular importance especially for

numerical data

AAST-Comp eng 2504072023

Results

Data preprocessing

Feature selection

Classification

Selection tool datamining

Performance evaluation Cycle

Dataset

Finding a feature subset that has the most discriminative information from the original feature space

The objective of feature selection is bull Improving the prediction performance of the

predictorsbull Providing a faster and more cost-effective

predictorsbull Providing a better understanding of the underlying

process that generated the data

Feature selection

AAST-Comp eng 2704072023

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 23: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Results

Data preprocessing

Feature selection Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Data Preprocessing

bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes

of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names

bull Quality decisions must be based on quality data measures

Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility

AAST-Comp eng 2404072023

Preprocessing techniques

bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and

resolve inconsistencies

bull Data integrationndash Integration of multiple databases data cubes or files

bull Data transformationndash Normalization and aggregation

bull Data reductionndash Obtains reduced representation in volume but produces the same or

similar analytical results

bull Data discretizationndash Part of data reduction but with particular importance especially for

numerical data

AAST-Comp eng 2504072023

Results

Data preprocessing

Feature selection

Classification

Selection tool datamining

Performance evaluation Cycle

Dataset

Finding a feature subset that has the most discriminative information from the original feature space

The objective of feature selection is bull Improving the prediction performance of the

predictorsbull Providing a faster and more cost-effective

predictorsbull Providing a better understanding of the underlying

process that generated the data

Feature selection

AAST-Comp eng 2704072023

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 24: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Data Preprocessing

bull Data in the real world is ndash incomplete lacking attribute values lacking certain attributes

of interest or containing only aggregate datandash noisy containing errors or outliersndash inconsistent containing discrepancies in codes or names

bull Quality decisions must be based on quality data measures

Accuracy Completeness Consistency Timeliness Believability Value added and Accessibility

AAST-Comp eng 2404072023

Preprocessing techniques

bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and

resolve inconsistencies

bull Data integrationndash Integration of multiple databases data cubes or files

bull Data transformationndash Normalization and aggregation

bull Data reductionndash Obtains reduced representation in volume but produces the same or

similar analytical results

bull Data discretizationndash Part of data reduction but with particular importance especially for

numerical data

AAST-Comp eng 2504072023

Results

Data preprocessing

Feature selection

Classification

Selection tool datamining

Performance evaluation Cycle

Dataset

Finding a feature subset that has the most discriminative information from the original feature space

The objective of feature selection is bull Improving the prediction performance of the

predictorsbull Providing a faster and more cost-effective

predictorsbull Providing a better understanding of the underlying

process that generated the data

Feature selection

AAST-Comp eng 2704072023

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 25: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Preprocessing techniques

bull Data cleaningndash Fill in missing values smooth noisy data identify or remove outliers and

resolve inconsistencies

bull Data integrationndash Integration of multiple databases data cubes or files

bull Data transformationndash Normalization and aggregation

bull Data reductionndash Obtains reduced representation in volume but produces the same or

similar analytical results

bull Data discretizationndash Part of data reduction but with particular importance especially for

numerical data

AAST-Comp eng 2504072023

Results

Data preprocessing

Feature selection

Classification

Selection tool datamining

Performance evaluation Cycle

Dataset

Finding a feature subset that has the most discriminative information from the original feature space

The objective of feature selection is bull Improving the prediction performance of the

predictorsbull Providing a faster and more cost-effective

predictorsbull Providing a better understanding of the underlying

process that generated the data

Feature selection

AAST-Comp eng 2704072023

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 26: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Results

Data preprocessing

Feature selection

Classification

Selection tool datamining

Performance evaluation Cycle

Dataset

Finding a feature subset that has the most discriminative information from the original feature space

The objective of feature selection is bull Improving the prediction performance of the

predictorsbull Providing a faster and more cost-effective

predictorsbull Providing a better understanding of the underlying

process that generated the data

Feature selection

AAST-Comp eng 2704072023

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 27: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Finding a feature subset that has the most discriminative information from the original feature space

The objective of feature selection is bull Improving the prediction performance of the

predictorsbull Providing a faster and more cost-effective

predictorsbull Providing a better understanding of the underlying

process that generated the data

Feature selection

AAST-Comp eng 2704072023

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 28: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Feature Selection

bull Transforming a dataset by removing some of its columns

A1 A2 A3 A4 C A2 A4 C

04072023 AAST-Comp eng 28

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 29: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 30: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Supervised Learningbull Supervision The training data (observations measurements etc) are

accompanied by labels indicating the class of the observationsbull New data is classified based on the model built on training setknown categories

AAST-Comp eng

Category ldquoArdquo

Category ldquoBrdquoClassification (Recognition) (Supervised Classification)

3004072023

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 31: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Classificationbull Everyday all the time we classify

thingsbull Eg crossing the street

ndash Is there a car comingndash At what speedndash How far is it to the other sidendash Classification Safe to walk or not

04072023 AAST-Comp eng 31

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 32: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023 AAST-Comp eng 32

Classification predicts categorical class labels (discrete or

nominal) classifies data (constructs a model) based on

the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction models continuous-valued functions ie

predicts unknown or missing values

Classification vs Prediction

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 33: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023 AAST-Comp eng 33

ClassificationmdashA Two-Step Process

Model construction describing a set of predetermined classes

Each tuplesample is assumed to belong to a predefined class as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules decision trees or mathematical formulae

Model usage for classifying future or unknown objects Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

If the accuracy is acceptable use the model to classify data tuples whose class labels are not known

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 34: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023 AAST-Comp eng 34

Classification Process (1) Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = lsquoprofessorrsquoOR years gt 6THEN tenured = lsquoyesrsquo

Classifier(Model)

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 35: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023 AAST-Comp eng 35

Classification Process (2) Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff Professor 4)

Tenured

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 36: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Classificationbull is a data mining (machine learning) technique used to

predict group membership for data instances bull Classification analysis is the organization of data in

given classbull These approaches normally use a training set where

all objects are already associated with known class labels

bull The classification algorithm learns from the training set and builds a model

bull Many classification models are used to classify new objects

AAST-Comp eng 3604072023

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 37: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Classification

bull predicts categorical class labels (discrete or nominal)

bull constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying unseen data

AAST-Comp eng 3704072023

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 38: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Quality of a classifierbull Quality will be calculated with respect to lowest

computing timebull Quality of certain model one can describe by confusion

matrix bull Confusion matrix shows a new entry properties

predictive ability of the method bull Row of the matrix represents the instances in a

predicted class while each column represents the instances in an actual class

bull Thus the diagonal elements represent correctly classified compounds

bull the cross-diagonal elements represent misclassified compounds

AAST-Comp eng 3804072023

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 39: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Classification Techniques

Building accurate and efficient classifiers for large databases is one of the essential tasks of data mining and machine learning research

The ultimate reason for doing classification is to increase understanding of the domain or to improve predictions compared to unclassified data

04072023AAST-Comp eng39

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 40: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Classification Techniques

classification

Techniques

Naiumlve Bays

SVM

C45

KNN

BF tree

IBK

40 04072023AAST-Comp eng

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 41: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Classification ModelSupport vector machine

Classifier

V Vapnik

04072023 AAST-Comp eng 41

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 42: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Support Vector Machine (SVM) SVM is a state-of-the-art learning machine

which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

04072023AAST-Comp eng42

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 43: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Support Vector Machine (SVM)

04072023AAST-Comp eng43

SVM is a state-of-the-art learning machine which has been extensively used as a tool for data classification function approximation etc

due to its generalization ability and has found a great deal of success in many applications

Unlike traditional methods which minimizing the empirical training error a noteworthy feature of SVM is that it minimize an upper bound of the generalization error through maximizing the margin between the separating hyper-plane and a data set

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 44: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Tennis example

Humidity

Temperature

= play tennis= do not play tennis

04072023 AAST-Comp eng 44

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 45: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Linear classifiers Which Hyperplane

bull Lots of possible solutions for a b cbull Some methods find a separating hyperplane

but not the optimal one bull Support Vector Machine (SVM) finds an

optimal solutionndash Maximizes the distance between the

hyperplane and the ldquodifficult pointsrdquo close to decision boundary

ndash One intuition if there are no points near the decision surface then there are no very uncertain classification decisions

45

This line represents the

decision boundary

ax + by minus c = 0

Ch 15

04072023 AAST-Comp eng

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 46: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Selection of a Good Hyper-Plane

Objective Select a `good hyper-plane usingonly the dataIntuition (Vapnik 1965) - assuming linear separability(i) Separate the data(ii) Place hyper-plane `far from data

04072023 AAST-Comp eng 46

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 47: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

SVM ndash Support Vector Machines

Support VectorsSmall Margin Large Margin

04072023 AAST-Comp eng 47

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 48: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Support Vector Machine (SVM)

bull SVMs maximize the margin around the separating hyperplane

bull The decision function is fully specified by a subset of training samples the support vectors

bull Solving SVMs is a quadratic programming problem

bull Seen by many as the most successful current text classification method

48

Support vectors

Maximizesmargin

Sec 151

Narrowermargin

04072023 AAST-Comp eng

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 49: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Non-Separable Case

04072023 AAST-Comp eng 49

The Lagrangian trick

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 50: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

SVM SVM

Relatively new concept Nice Generalization properties Hard to learn ndash learned in batch mode

using quadratic programming techniques Using kernels can learn very complex

functions

04072023 AAST-Comp eng 51

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 51: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Classification ModelK-Nearest Neighbor

Classifier04072023 AAST-Comp eng 52

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 52: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

K-Nearest Neighbor Classifier

Learning by analogyTell me who your friends are and Irsquoll

tell you who you areA new example is assigned to the

most common class among the (K) examples that are most similar to it

04072023 AAST-Comp eng 53

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 53: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

K-Nearest Neighbor Algorithm To determine the class of a new example

E Calculate the distance between E and all

examples in the training set Select K-nearest examples to E in the training

set Assign E to the most common class among its

K-nearest neighbors

Response

ResponseNo response

No response

No response

Class Response04072023 AAST-Comp eng 54

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 54: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Each example is represented with a set of numerical attributes

ldquoClosenessrdquo is defined in terms of the Euclidean distance between two examples The Euclidean distance between X=(x1 x2 x3hellipxn) and

Y =(y1y2 y3hellipyn) is defined as

Distance (John Rachel)=sqrt [(35-41)2+(95K-215K)2 +(3-2)2]

n

iii yxYXD

1

2)()(

JohnAge=35Income=95KNo of credit cards=3

Rachel Age=41Income=215KNo of credit cards=2

Distance Between Neighbors

04072023 AAST-Comp eng 55

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 55: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Instance Based Learning No model is built Store all training examples Any processing is delayed until a new instance

must be classified

Response

Response No response

No response

No response

Class Respond04072023 AAST-Comp eng 56

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 56: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Example 3-Nearest Neighbors

Customer Age

Income

No credit cards

Response

John 35 35K 3 No

Rachel 22 50K 2 Yes

Hannah 63 200K 1 No

Tom 59 170K 1 No

Nellie 25 40K 4 Yes

David 37 50K 2 04072023 AAST-Comp eng 57

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 57: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Customer Age

Income (K)

No cards

John 35 35 3

Rachel 22 50 2

Hannah 63 200 1

Tom 59 170 1

Nellie 25 40 4

David 37 50 2

ResponseNo

Yes

No

No

Yes

Distance from Davidsqrt [(35-37)2+(35-50)2 +(3-2)2]=1516sqrt [(22-37)2+(50-50)2 +(2-2)2]=15sqrt [(63-37)2+(200-50)2 +(1-2)2]=15223sqrt [(59-37)2+(170-50)2 +(1-2)2]=122sqrt [(25-37)2+(40-50)2 +(4-2)2]=1574Yes

04072023 AAST-Comp eng 58

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 58: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Strengths and WeaknessesStrengths Simple to implement and use Comprehensible ndash easy to explain prediction Robust to noisy data by averaging k-nearest

neighbors

Weaknesses Need a lot of space to store all examples Takes more time to classify a new example than

with a model (need to calculate and compare distance from new example to all other examples)

04072023 AAST-Comp eng 59

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 59: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Decision Tree

04072023 AAST-Comp eng 60

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 60: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

ndash Decision tree induction is a simple but powerful learning paradigm In this method a set of training examples is broken down into smaller and smaller subsets while at the same time an associated decision tree get incrementally developed At the end of the learning process a decision tree covering the training set is returned

ndash The decision tree can be thought of as a set sentences written propositional logic

04072023 AAST-Comp eng 61

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 61: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Example

Jenny Lind is a writer of romance novels A movie company and a TV network both want exclusive rights to one of her more popular works If she signs with the network she will receive a single lump sum but if she signs with the movie company the amount she will receive depends on the market response to her movie What should she do

04072023 AAST-Comp eng 62

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 62: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Payouts and Probabilitiesbull Movie company Payouts

ndash Small box office - $200000ndash Medium box office - $1000000ndash Large box office - $3000000

bull TV Network Payoutndash Flat rate - $900000

bull Probabilitiesndash P(Small Box Office) = 03ndash P(Medium Box Office) = 06ndash P(Large Box Office) = 01

04072023 AAST-Comp eng 63

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 63: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

AAST-Comp eng 6404072023

Jenny Lind - Payoff Table

Decisions

States of Nature

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Company $200000 $1000000 $3000000

Sign with TV Network $900000 $900000 $900000

PriorProbabilities

03 06 01

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 64: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Using Expected Return Criteria

EVmovie=03(200000)+06(1000000)+01(3000000)

= $960000 = EVUII or EVBest

EVtv =03(900000)+06(900000)+01(900000)

= $900000

Therefore using this criteria Jenny should select the movie contract

04072023 AAST-Comp eng 65

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 65: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Decision Treesbull Three types of ldquonodesrdquo

ndash Decision nodes - represented by squares ()ndash Chance nodes - represented by circles (Ο)ndash Terminal nodes - represented by triangles (optional)

bull Solving the tree involves pruning all but the best decisions at decision nodes and finding expected values of all possible states of nature at chance nodes

bull Create the tree from left to right bull Solve the tree from right to left

04072023 AAST-Comp eng 66

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 66: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Example Decision Tree

Decision node

Chance node

Decision 1

Decision 2

Event 1

Event 2

Event 3

04072023 AAST-Comp eng 67

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 67: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

04072023 AAST-Comp eng 68

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 68: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Jenny Lind Decision Tree

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER

ER

ER

04072023 AAST-Comp eng 69

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 69: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Jenny Lind Decision Tree - Solved

Small Box Office

Medium Box Office

Large Box Office

Small Box Office

Medium Box Office

Large Box Office

Sign with Movie Co

Sign with TV Network

$200000

$1000000

$3000000

$900000

$900000

$900000

3

6

1

3

6

1

ER900000

ER960000

ER960000

04072023 AAST-Comp eng 70

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 70: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Results

Data preprocessing

Feature selection

Classification

Selection tool data mining

Performance evaluation Cycle

Dataset

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 71: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Evaluation Metrics

Predicted as healthy Predicted as unhealthy

Actual healthy tp fn

Actual not healthy fp tn

AAST-Comp eng 7204072023

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 72: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Cross-validation

bull Correctly Classified Instances 143 953bull Incorrectly Classified Instances 7 467 bull Default 10-fold cross validation ie

ndash Split data into 10 equal sized piecesndash Train on 9 pieces and test on remainderndash Do for all possibilities and average

04072023 AAST-Comp eng 73

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 73: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng74

A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 74: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Abstract The aim of this paper is to investigate the

performance of different classification techniques

Aim is developing accurate prediction models for breast cancer using data mining techniques

Comparing three classification techniques in Weka software and comparison results

Sequential Minimal Optimization (SMO) has higher prediction accuracy than IBK and BF Tree methods

75

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 75: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

0407202376

Introduction Breast cancer is on the rise across developing nations due to the increase in life expectancy and lifestyle changes

such as women having fewer children Benign tumorsbull Are usually not harmfulbull Rarely invade the tissues around thembull Donlsquot spread to other parts of the bodybull Can be removed and usually donlsquot grow back Malignant tumorsbull May be a threat to lifebull Can invade nearby organs and tissues (such as the chest

wall)bull Can spread to other parts of the bodybull Often can be removed but sometimes grow back

AAST-Comp eng

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 76: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023

Risk factors

Gender Age Genetic risk factors Family history Personal history of breast cancer Race white or black Dense breast tissue denser breast tissue have

a higher risk Certain benign (not cancer) breast problems Lobular carcinoma in situ Menstrual periods

77 AAST-Comp eng

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 77: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

Risk factors

Breast radiation early in life Treatment with DES the drug DES

(diethylstilbestrol) during pregnancy Not having children or having them later in

life Certain kinds of birth control Using hormone therapy after menopause Not breastfeeding Alcohol Being overweight or obese

78

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 78: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

0407202379

BACKGROUND Bittern et al used artificial neural network to

predict the survivability for breast cancer patients They tested their approach on a limited data set but their results show a good agreement with actual survival Traditional segmentation

Vikas Chaurasia et al used Representive Tree RBF Network and Simple Logistic to predict the survivability for breast cancer patients

Liu Ya-Qinlsquos experimented on breast cancer data using C5 algorithm with bagging to predict breast cancer survivability

AAST-Comp eng

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 79: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023

BACKGROUND Bellaachi et al used naive bayes decision tree

and back-propagation neural network to predict the survivability in breast cancer patients Although they reached good results (about 90 accuracy) their results were not significant due to the fact that they divided the data set to two groups one for the patients who survived more than 5 years and the other for those patients who died before 5 years

Vikas Chaurasia et al used Naive Bayes J48 Decision Tree to predict the survivability for Heart Diseases patients

80 AAST-Comp eng

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 80: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023

BACKGROUND Vikas Chaurasia et al used CART (Classification and

Regression Tree) ID3 (Iterative Dichotomized 3) and decision table (DT) to predict the survivability for Heart Diseases patients

Pan wen conducted experiments on ECG data to identify abnormal high frequency electrocardiograph using decision tree algorithm C45

Dong-Sheng Caolsquos proposed a new decision tree based ensemble method combined with feature selection method backward elimination strategy to find the structure activity relationships in the area of chemo metrics related to pharmaceutical industry

81 AAST-Comp eng

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 81: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023

BACKGROUND Dr SVijayarani et al analyses the

performance of different classification function techniques in data mining for predicting the heart disease from the heart disease dataset The classification function algorithms is used and tested in this work The performance factors used for analyzing the efficiency of algorithms are clustering accuracy and error rate The result illustrates shows logistics classification function efficiency is better than multilayer perception and sequential minimal optimization

82 AAST-Comp eng

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 82: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng

BACKGROUND Kaewchinporn Clsquos presented a new classification

algorithm TBWC combination of decision tree with bagging and clustering This algorithm is experimented on two medical datasets cardiocography1 cardiocography2 and other datasets not related to medical domain

BS Harish et al presented various text representation schemes and compared different classifiers used to classify text documents to the predefined classes The existing methods are compared and contrasted based on various parameters namely criteria used for classification

83

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 83: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng84

BREAST-CANCER-WISCONSIN DATA SET

SUMMARY

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 84: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

BREAST-CANCER-WISCONSIN DATA SET SUMMARY the UC Irvine machine learning repository Data from University of Wisconsin Hospital Madison

collected by dr WH Wolberg 2 classes (malignant and benign) and 9 integer-valued

attributes breast-cancer-Wisconsin having 699 instances We removed the 16 instances with missing values from

the dataset to construct a new dataset with 683 instances Class distribution Benign 458 (655) Malignant 241

(345) Note 2 malignant and 14 benign excluded hence

percentage is wrong and the right one is benign 444 (65) and malignant 239 (35)

04072023AAST-Comp eng85

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 85: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023 AAST-Comp eng 86

Attribute DomainSample Code Number Id Number

Clump Thickness 1 - 10

Uniformity Of Cell Size 1 - 10

Uniformity Of Cell Shape 1 - 10

Marginal Adhesion 1 - 10

Single Epithelial Cell Size 1 - 10

Bare Nuclei 1 - 10

Bland Chromatin 1 - 10

Normal Nucleoli 1 - 10

Mitoses 1 - 10

Class 2 For Benign 4 For Malignant

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 86: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

0407202387

EVALUATION METHODS We have used the Weka (Waikato Environment for

Knowledge Analysis) version 369 WEKA is a collection of machine learning algorithms

for data mining tasks The algorithms can either be applied directly to a

dataset or called from your own Java code WEKA contains tools for data preprocessing

classification regression clustering association rules visualization and feature selection

It is also well suited for developing new machine learning schemes

WEKA is open source software issued under the GNU General Public License

AAST-Comp eng

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 87: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

EXPERIMENTAL RESULTS

88 04072023AAST-Comp eng

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 88: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

EXPERIMENTAL RESULTS

89 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 89: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

importance of the input variables

04072023AAST-Comp eng90

Domain 1 2 3 4 5 6 7 8 9 10 Sum

Clump Thickness 139 50 104 79 128 33 23 44 14 69 683

Uniformity of Cell Size 373 45 52 38 30 25 19 28 6 67 683

Uniformity of Cell Shape 346 58 53 43 32 29 30 27 7 58 683

Marginal Adhesion 393 58 58 33 23 21 13 25 4 55 683

Single Epithelial Cell Size 44 376 71 48 39 40 11 21 2 31 683

Bare Nuclei 402 30 28 19 30 4 8 21 9 132 683

Bare Nuclei 150 160 161 39 34 9 71 28 11 20 683

Normal Nucleoli 432 36 42 18 19 22 16 23 15 60 683

Mitoses 563 35 33 12 6 3 9 8 0 14 683

Sum 2843 850 605 333 346 192 207 233 77 516

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 90: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

EXPERIMENTAL RESULTS

91 04072023AAST-Comp eng

Evaluation Criteria

Classifiers

BF TREE IBK SMO

Timing To Build Model (In Sec)

097 002 033

Correctly Classified Instances

652 655 657

Incorrectly Classified Instances

31 28 26

Accuracy () 9546 9590 9619

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 91: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

EXPERIMENTAL RESULTS The sensitivity or the true positive rate (TPR) is defined

by TP (TP + FN) the specificity or the true negative rate (TNR) is

defined by TN (TN + FP) the accuracy is defined by (TP + TN) (TP + FP + TN +

FN) True positive (TP) = number of positive samples

correctly predicted False negative (FN) = number of positive samples

wrongly predicted False positive (FP) = number of negative samples

wrongly predicted as positive True negative (TN) = number of negative samples

correctly predicted92 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 92: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

EXPERIMENTAL RESULTSClassifi

erTP FP Precisio

nRecall Class

BF Tree

0971 0075 096 0971 Benign

0925 0029 0944 0925 Malignant

IBK

098 0079 0958 098 Benign

0921 002 0961 0921 Malignant

SMO

0971 0054 0971 0971 Benign

0946 0029 0946 0946 Malignant

93 04072023AAST-Comp eng

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 93: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

EXPERIMENTAL RESULTSClassifier Benign Malignant Class

BF Tree

431 13 Benign

18 221 Malignant

IBK

435 9 Benign

19 220 Malignant

SMO

431 13 Benign

13 226 Malignant

94 04072023AAST-Comp eng

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 94: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

importance of the input variables

04072023AAST-Comp eng95

variable Chi-squared

Info Gain

Gain Ratio

Average Rank IMPORTANCE

Clump Thickness 37808158 0464 0152 12623252

6 8Uniformity of

Cell Size

53979308 0702 03 180265026 1

Uniformity of Cell Shape 52307097 0677 0272

17467332

32

Marginal Adhesion 3900595 0464 021 1302445 7

Single Epithelial Cell Size

44786118 0534 0233

149542726

5

Bare Nuclei 48900953 0603 0303

163305176

3

Bland Chromatin 45320971 0555 0201

15132190

34

Normal Nucleoli 41663061 0487 0237

13911820

36

Mitoses 1919682 0212 0212 64122733 9

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 95: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023AAST-Comp eng96

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 96: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

0407202397

CONCLUSION the accuracy of classification techniques is

evaluated based on the selected classifier algorithm

we used three popular data mining methods Sequential Minimal Optimization (SMO) IBK BF Tree

The performance of SMO shows the high level compare with other classifiers

most important attributes for breast cancer survivals are Uniformity of Cell Size

AAST-Comp eng

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 97: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

0407202398

Future work using updated version of weka Using another data mining tool Using alternative algorithms and techniques

AAST-Comp eng

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 98: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

Notes on paper Spelling mistakes No point of contact (e - mail) Wrong percentage calculation Copying from old papers Charts not clear No contributions

04072023AAST-Comp eng99

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 99: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

comparison Breast Cancer Diagnosis on Three Different

Datasets Using Multi-Classifiers written International Journal of Computer and

Information Technology (2277 ndash 0764) Volume 01ndash Issue 01 September 2012

Paper introduced more advanced idea and make a fusion between classifiers

04072023AAST-Comp eng100

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 100: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

References

101AAST-Comp eng

[1] US Cancer Statistics Working Group United States CancerStatistics 1999ndash2008 Incidence and Mortality Web-based ReportAtlanta (GA) Department of Health and Human Services Centers forDisease Control[2] Lyon IAfRoC World Cancer Report International Agency for Research on Cancer Press 2003188-193 [3] Elattar Inas ldquoBreast Cancer Magnitude of the ProblemrdquoEgyptian Society of Surgical Oncology Conference TabaSinai in Egypt (30 March ndash 1April 2005)

[2] S Aruna Dr SP Rajagopalan and LV Nandakishore (2011)Knowledge based analysis of various statistical tools in detectingbreast cancer[3] Angeline Christobel Y Dr Sivaprakasam (2011) An EmpiricalComparison of Data Mining Classification Methods InternationalJournal of Computer Information SystemsVol 3 No 2 2011[4] DLavanya DrKUsha Ranirdquo Analysis of feature selection withclassification Breast cancer datasetsrdquoIndian Journal of ComputerScience and Engineering (IJCSE)October 201104072023

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 101: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

AAST-Comp eng 102

[5] EOsuna RFreund and F Girosi ldquoTraining support vector machinesApplication to face detectionrdquo Proceedings of computer vision and pattern recognition Puerto Rico pp 130ndash1361997[6] Vaibhav Narayan Chunekar Hemant P Ambulgekar (2009)Approach of Neural Network to Diagnose Breast Cancer on three different Data Set 2009 International Conference on Advances in Recent Technologies in Communication and Computing[7] D Lavanya ldquoEnsemble Decision Tree Classifier for Breast Cancer Datardquo International Journal of Information Technology Convergence and Services vol 2 no 1 pp 17-24 Feb 2012[8] BSter and ADobnikar ldquoNeural networks in medical diagnosis Comparison with other methodsrdquo Proceedings of the international conference on engineering applications of neural networks pp 427ndash430 1996

04072023

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 102: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

[9] TJoachims Transductive inference for text classification using support vector machines Proceedings of international conference machine learning Slovenia 1999[10] JAbonyi and F Szeifert ldquoSupervised fuzzy clustering for the identification of fuzzy classifiersrdquo Pattern Recognition Letters vol14(24) 2195ndash22072003[11] Frank A amp Asuncion A (2010) UCI Machine Learning Repository [httparchiveicsucieduml] Irvine CA University of CaliforniaSchool of Information and Computer Science[12] William H Wolberg MD W Nick Street PhD Dennis M Heisey PhD Olvi L Mangasarian PhD computerized breast cancer diagnosis and prognosis from fine needle aspirates Western Surgical Association meeting in Palm Desert California November 14 1994

AAST-Comp eng 10304072023

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 103: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

AAST-Comp eng 104

[13] Street WN Wolberg WH Mangasarian OL Nuclear feature extraction for breast tumor diagnosis Proceedings ISampT SPIE International Symposium on Electronic Imaging 1993 1905861ndash70[14] Chen Y Abraham A Yang B(2006) Feature Selection and Classification using Flexible Neural Tree Journal of Neurocomputing 70(1-3) 305ndash313[15] J Han and M KamberrdquoData Mining Concepts and TechniquesrdquoMorgan Kauffman Publishers 2000[16] Bishop CM ldquoNeural Networks for Pattern Recognitionrdquo Oxford University PressNew York (1999)[17] Vapnik VN The Nature of Statistical Learning Theory 1st edSpringer-VerlagNew York 1995[18] Ross Quinlan (1993) C45 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA185

04072023

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105
Page 104: A Novel Approach for Breast Cancer Detection using Data Mining Techniques

04072023105

Thank you

AAST-Comp eng

  • A Novel Approach for Breast Cancer Detection using Data Mining
  • AGENDA
  • AGENDA (Cont)
  • What Is Cancer
  • Slide 5
  • Breast Cancer
  • History and Background
  • Breast Cancer Classification
  • Breast cancerrsquos Features
  • Diagnosis or prognosis
  • Computer-Aided Diagnosis
  • Computational Intelligence
  • What do these methods do
  • Pattern recognition system decomposition
  • Slide 15
  • data sets
  • Slide 17
  • Data Mining
  • Predictive amp descriptive data mining
  • Data Mining Models and Tasks
  • Data mining Tools
  • weka
  • Slide 23
  • Data Preprocessing
  • Preprocessing techniques
  • Slide 26
  • Slide 27
  • Feature Selection
  • Slide 29
  • Supervised Learning
  • Classification
  • Classification vs Prediction
  • ClassificationmdashA Two-Step Process
  • Classification Process (1) Model Construction
  • Classification Process (2) Use the Model in Prediction
  • Classification (2)
  • Classification
  • Quality of a classifier
  • Classification Techniques
  • Classification Techniques (2)
  • Classification Model
  • Support Vector Machine (SVM)
  • Support Vector Machine (SVM) (2)
  • Tennis example
  • Linear classifiers Which Hyperplane
  • Selection of a Good Hyper-Plane
  • SVM ndash Support Vector Machines
  • Support Vector Machine (SVM) (3)
  • Non-Separable Case
  • SVM
  • Classification Model (2)
  • Slide 53
  • K-Nearest Neighbor Algorithm
  • Slide 55
  • Instance Based Learning
  • Example 3-Nearest Neighbors
  • Slide 58
  • Slide 59
  • Decision Tree
  • Slide 61
  • Example
  • Payouts and Probabilities
  • Jenny Lind - Payoff Table
  • Using Expected Return Criteria
  • Decision Trees
  • Example Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree
  • Jenny Lind Decision Tree - Solved
  • Slide 71
  • Evaluation Metrics
  • Cross-validation
  • A Novel Approach for Breast Cancer Detection using Data Mining
  • Abstract
  • Introduction
  • Risk factors
  • Risk factors (2)
  • BACKGROUND
  • BACKGROUND (2)
  • BACKGROUND (3)
  • BACKGROUND (4)
  • BACKGROUND (5)
  • Slide 84
  • BREAST-CANCER-WISCONSIN DATA SET SUMMARY
  • Slide 86
  • EVALUATION METHODS
  • EXPERIMENTAL RESULTS
  • EXPERIMENTAL RESULTS (2)
  • importance of the input variables
  • EXPERIMENTAL RESULTS (3)
  • EXPERIMENTAL RESULTS (4)
  • EXPERIMENTAL RESULTS (5)
  • EXPERIMENTAL RESULTS (6)
  • importance of the input variables (2)
  • Slide 96
  • CONCLUSION
  • Future work
  • Notes on paper
  • comparison
  • References
  • Slide 102
  • Slide 103
  • Slide 104
  • Slide 105