logo - welcome to snu biointelligence lab!! classification using weka ... @relation...

37
Classification using Weka (Brain, Computation, and Neural Learning) Jung-Woo Ha

Upload: lamanh

Post on 02-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

LOGO

Classification using Weka (Brain, Computation, and Neural Learning)

Jung-Woo Ha

2

Agenda

Classification

General Concept

Terminology

Introduction to Weka

Classification practice with Weka

Problems: Pima Indians diabetes, handwritten digit recognition

Algorithms: Neural Networks, Decision Trees, Support Vector Machines

Evaluation criteria

Using Experimenter for batch experiments

Building committee machine

Mini-project

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Machine Classification

Sorting fish on a

conveyor belt:

Salmon (연어) vs. sea bass (농어)

set up a camera, take images and use some physical differences (length, lightness, width, fin shape, mouth position, etc) to explore.

3 (C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

4

Concept of Classification

<Notations>

n = # training examples

x = “input” variables (features or attributes)

y = “output” variable / “target” variable

(x, y) – training example

The i-th training example = (x(i), y(i))

Training Set

Learning Algorithm

h

hypothesis

Input features Output / prediction

e.g. pixels in a picture of

handwritten digit

‘3’ or ‘8’

nnxwxwwf

110)(x

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Terminology

Features or Attributes

Features are the individual measurable properties of the phenomena being observed

Choosing discriminating and independent features is key to any pattern recognition algorithm being successful in classification

Training set / Test set

Training set: A set of examples used for learning, that is to fit the parameters [i.e., weights] of the classifier

Test set: A set of examples used only to assess the performance [generalization] of a fully-specified classifier

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 5

6

Introduction to Weka

Weka: Data Mining Software in Java

Weka is a collection of machine learning algorithms for data mining & machine learning tasks

What you can do with Weka?

data pre-processing, feature selection, classification, regression, clustering, association rules, and visualization

Weka is an open source software issued under the GNU General Public License

How to get? http://www.cs.waikato.ac.nz/ml/weka/ or just type „Weka‟ in google.

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Dataset #1: Pima Indians Diabetes

Description Pima Indians have the highest prevalence of diabetes in the world

We will build classification models that diagnose if the patient shows signs of diabetes

http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes

Configuration of the data set 768 instances

8 attributes

age, number of times pregnant, results of medical tests/analysis

all numeric (integer or real-valued)

Also, a discretized set will be provided

Class value = 1 (Positive example )

Interpreted as "tested positive for diabetes"

500 instances

Class value = 0 (Negative example)

268 instances

7 (C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Dataset #2: Handwritten Digits (MNIST)

Description

The MNIST database of handwritten digits contains digits written by office workers and students

We will build a recognition model based on classifiers with the reduced set of MNIST

http://yann.lecun.com/exdb/mnist/

Configuration of the data set

Attributes

pixel values in gray level in a 28x28 image

784 attributes (all 0~255 integer)

Full MNIST set

Training set: 60,000 examples

Test set: 10,000 examples

For our practice, a reduced set with 800 examples is used

Class value: 0~9, which represent digits from 0 to 9

8 (C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9

Artificial Neural Networks

MLP (Multilayer Perceptron)

In Weka, Classifiers-functions-MultilayerPerceptron

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

: Four main parameters for learning MLPs

Artificial Neural Networks

Reviews on BP algorithm

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 10

The Number of iterations

Learning rate Momentum

The number of hidden layers

and hidden nodes

Reviews on MLPs

Expression power of MLPs

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 11

12

Decision Trees

J48 (Java implementation of C4.5)

In Weka, classifiers-trees-J48

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Support Vector Machines

SMO (sequential minimal optimization) for training SVM

In Weka, classifiers-functions-SMO

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 13

Practice

Basic

Comparing the performances of algorithms

MultilayerPerceptron vs. J48 vs. SVM

Checking the trained model (structure & parameter)

Tuning parameters to get better models

Understanding „Test options‟ & „Classifier output‟ in Weka

Advanced

Building committee machines using „meta‟ algorithms for classification

Preprocessing / data manipulation – applying „Filter‟

Batch experiment with „Experimenter‟

Design & run a batch process with „KnowledgeFlow‟ 14 (C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Dataset for Practice with Weka

Pima Indians diabetes

Original data: pima_diabetes.arff

Discretized data: pima_diabetes_supervised_discretized.arff

Handwritten Digit (MNIST)

Training/test pair

mnist_reduced_training.arff, mnist_reduced_test.arff

800 & 200 instances, respectively

Total set (1,000 instances)

mnist_reduced_total.arff

Can be used for cross-validation

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 15

Data format for Weka (.ARFF)

@relation heart-disease-simplified

@attribute age numeric

@attribute sex { female, male}

@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}

@attribute cholesterol numeric

@attribute exercise_induced_angina { no, yes}

@attribute class { present, not_present}

@data

63,male,typ_angina,233,no,not_present

67,male,asympt,286,yes,present

67,male,asympt,229,yes,present

38,female,non_anginal,?,no,not_present

Data

(CSV format)

Header

16

Note: You can easily generate ‘arff’ file by adding a header to a usual CSV text file

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Neural Networks in Weka

17

click • load a file that contains the

training data by clicking

‘Open file’ button

• ‘ARFF’ or ‘CSV’ formats are

readible

• Click ‘Classify’ tab

• Click ‘Choose’ button

• Select ‘weka – function

- MultilayerPerceptron

• Click ‘MultilayerPerceptron’

• Set parameters for MLP

• Set parameters for Test

• Click ‘Start’ for learning

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

18

Some Notes on the Parameter Setting

Parameter Setting = Car Tuning

need much experience or many times of trial

you may get worse results if you are unlucky

Multilayer Perceptron (MLP)

Main parameters for learning: hiddenLayers, learningRate, momentum, trainingTime (epoch), seed

J48

Main parameters: unpruned, numFolds, minNumObj

Many parameters are for controlling the size of the result tree, i.e. confidenceFactor, pruning

SMO (SVM)

Main parameters: c (complexity parameter), kernel, kernel parameters

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Test Options and Classifier Output

19

There are

various metrics

for evaluation

Setting the

data set used

for evaluation

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

20

How to Evaluate the Performance? (1/2)

Usually, build a Confusion Matrix out of given data

Evaluation Metrics Accuracy (percent correct)

Precision

Recall

Many other metrics: F-measure, Kappa score, etc.

For fare evaluation, the

‘cross-validation’ scheme is used

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

21

How to Evaluate the Performance? (2/2)

Confusion Matrix Real

Prediction Positive Negative

Positive TP FP All with positive

Test

Negative FN TN All with

Negative Test

All with Disease

All without Disease

Everyone

FNTNFPTP

TNTP

Accuracy

FNTP

TP

Recall

FPTP

TP

Precision

As recall ↑ precision ↓

conversely:

As recall ↓ precision ↑

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

22

Evaluation Method - Cross Validation

K-fold Cross Validation

The data set is randomly divided into k subsets.

One of the k subsets is used as the „test set‟ and the other k-1 subsets are put together to form a „training set‟.

128 128 128 128 128

D1 D2 D3 D4 D5

128

D6

128 128 128 128 128

D1 D2 D3 D4 D6

128

D5

128 128 128 128 128

D2 D3 D4 D5 D6

128

D1

k

i

iErrork

Error1

1

6-fold cross validation

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Using committee machine / ensemble learning in Weka

Boosting: AdaBoostM1

Voting committee: Vote

Bagging

Committee Machine in Weka

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 23

Data Manipulation with Filter in Weka

Attribute

Selection, discretize

Instance Re-sampling, selecting specified folds

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 24

Using Experimenter in Weka

Tool for ‘Batch’ experiments

25

click

• Set experiment type/iteration

control

• Set datasets / algorithms

Click ‘New’

• Select ‘Run’ tab and click ‘Start’

• If it has finished successfully, click

‘Analyse’ tab and see the summary

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

KnowledgeFlow for Analysis Process Design

26

(‘Process Flow Diagram’ of SAS® Enterprise Miner )

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

References

Weka Wiki: http://weka.wikispaces.com/

Weka online documentation: http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html

Textbooks

Tom Mitchell (1997) Machine Learning, McGraw Hill

Christopher M. Bishop (2006) Pattern Recognition and Machine Learning, Springer

Richard O. Duda, Peter E. Hart, David G. Stork (2001) Pattern classification (2nd edition), Wiley, New York

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 27

Mini-project

Make an arff file

Make a csv file with MS Excel.

Open the csv file with Weka

Save the csv file as an arff file

Modify the property value of „class‟ to discrete value set with any text editor program

Save the arff file

Reload the arff file with Weka

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 28

Mini-project

29

click • load a file that contains the

training data by clicking

‘Open file’ button

• ‘ARFF’ or ‘CSV’ formats are

readible

• Click ‘Classify’ tab

• Click ‘Choose’ button

• Select ‘weka – function

- MultilayerPerceptron

• Click ‘MultilayerPerceptron’

• Set parameters for MLP

• Set parameters for Test

• Click ‘Start’ for learning

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Mini-project

Parameter setting of MLPs

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 30

More explanations

on the parameters

Test Options and Classifier Output

31

There are

various metrics

for evaluation

Setting the

data set used

for evaluation

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Mini-project

Make a MLP by yourself with GUI option

You can make the hidden layers by yourself.

When clicking more button, you can get details of explanation for GUI.

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 32

Mini-project

J48

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 33

Mini-project

Experiments

Convenient comparisons on data and methods

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 34

Experiments

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 35

Mini-project

Classification problem with Weka

Data set

3 different data sets

You should include at least one set from UCI ML repository and MNIST set (http://archive.ics.uci.edu/ml/)

Classification methods

MLP: iters, learning rate, momentum, # of hidden nodes

SVM: will be addressed in next time

J48: Default options only

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 36

Mini term-project

Contents in the report

You should

compare the results of various parameter settings for MLPs

find optimal parameter setting for MLP and report the classification performance on that setting on all data sets

Compare the best MLP result to the result of J48 on three data sets (classification and time)

Include discussions

At most A4 four pages

Due date: 24th Nov. 2011(302-314-1)

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 37