how use weka tool

35
The university of Poonch Data Mining Bs(Cs) 6th

Upload: hansa-khan

Post on 22-Feb-2017

46 views

Category:

Business


0 download

TRANSCRIPT

The university of Poonch Data Mining Bs(Cs) 6th semester

DATASET IS: Contact Lenses

WHAT IS WEKA ? Weka stands for Waikato Environment for

knowledge. Weka contains tools for data pre- processing, classification, regression and clustering. Weka is a collection of machine learning

algorithm for data mining task.

HOW WEKA START

From window desktop: click start, choose All programs,

choose Weka 3-7 to start Weka. Then the first interface window

appear.

EXPLORER Explorer is used for pre-

processing, attribute selection, learning and visualization.

When we select explorer the environment that will open is:

Now I click on open file to open a data file from the folder where data files are stored.

Then I select my dataset “CONTACT LENSES”

Every instance consist a number of attributes

WHEN I CHOOSE CONTACT LENSES THIS ENVIRONMENT WILL OPEN

CHOOSE FILER First we choose filter. There are two filters: Supervised unsupervised. We then selected unsupervised filter: In unsupervised filter there are two options Instance attribute We selected attribute: There are many attributes but we choose the attribute

that is Nominal To Binary.

THEN OPEN THAT WINDOW

NOW I CHOOSE CLASSIFY

Firstly there is a simple classifier ZeroR. Determines the most common class Or the median (in the case of numeric values) Tests how well the class can be predicted without considering other attributes

NOW I CHOOSE THE CLASSIFIER NAIVEBAYES

THERE ARE FOUR OPTIONS

Use training set: The classifier is evaluated on how well it predicts the

class of the instances it was trained on.

Supplied test set: The classifier is evaluated on how well it

predicts the class of a set of instances loaded from a file. Clicking the Set... Button brings up a dialog allowing you to choose the file to test on.

Percentage split: • The classifier is evaluated on how well it

predicts a certain percentage of the data which is held out for testing. The amount of data held out depends on the value entered in the % field.

Cross-validation (CV): The classifier is evaluated by cross-validation,

using the number of folds that are entered in the Folds text field.

Having 10 folds means 90% of full data is used for training (and 10% for testing) in each fold test.

cross-validation produces a fair estimation of test performance.

WHEN I APPLY NAÏVE BAYES CLASSIFIER WITH CROSS VALIDATION THEN THE RESULTS ARE IN THIS FORM:

WHEN WE CHOOSE TRAINING SET THEN RESULT IS:

SUPPLIED TEST

When we choose supplied test set data it gives the same result as when we choose training set. The results are same of both supplied test set and training set.

WHEN WE APPLY PERCENTAGE SPLIT THEN RESULT IS:

TRUE POSITIVE (TP) The True Positive (TP) rate is the proportion of

examples which were classified as class x, among all examples which truly have class x, i.e. how much part of the class was captured. It is equivalent to Recall. In the confusion matrix, this is the diagonal element divided by the sum over the relevant row, i.e.4/(4+0+1)=0.8 for class soft and 1/(0+1+3)=0.425 for class hard 4/(4+0+1)=0.8 for none class in our example.

FALSE POSITIVE (FP): The False Positive (FP) rate is the proportion of

examples which were classified as class x, but belong to a different class, among all examples which are not of class x. In the matrix, this is the column sum of class x minus the diagonal element, divided by the rows sums of all other classes; i.e. 1/1+2+12=0.053 for class soft and 1/1+0+4=0.8 for class hard.

PRECISION The Precision is the proportion of the examples

which truly have class x among all those which were classified as class x. In the matrix, this is the diagonal element divided by the sum over the relevant column, i.e. 4/(4+0+1)=0.8 for class soft and 1/(0+1+3)=0.333 for class hard class 12/(12+3+1)=0.75 for class none

F-MEASURE2*Precision*Recall / (Precision + Recall)A combined measure for precision andRecall for class soft (2*0.8*0.8)/(0.8+0.8)=0.8 for class hard (2*0.333*0.25)/(0.333+0.8)=0.286 for class none (2*0.75*0.8)/(0.75+0.8)=0.774

ROC (RECEIVER OPERATING CHARACTERISTICS) AND RECALL:Accuracy is measured by the area under the

ROC curve. An area of 1 represents a perfect test; an area of .5 represents a worthless test. A rough guide for classifying the accuracy of a diagnostic test is the traditional academic point system: .90-1 = excellent (A)

Recall: All the documents that have exactly retrieved from the query.It is equivalent to TP.

I can change the folds in cross validation.

If I change the folds from 10 to 5 then its means that the folds are

80% trained.