1 lecture 5: automatic cluster detection lecture 6: artificial neural networks lecture 7: evaluation...

16
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures Transparencies prepared by Ho Tu Bao [JAIST]

Post on 19-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

1

Lecture 5: Automatic cluster detection

Lecture 6: Artificial neural networks

Lecture 7: Evaluation of discovered knowledge

Brief introduction to lectures

Transparencies prepared by Ho Tu Bao [JAIST]

2

Lecture 5: Automatic Cluster Detection

•One of the most widely used KDD classification techniques for unsupervised data.

•Content of the lecture

1. Introduction2. Partitioning Clustering3. Hierarchical Clustering4. Software and case-studies

•Prerequisite: Nothing special

3

Partitioning Clustering

:conditions following the satisfying ,clusters called often n),(K

X of P,...,P,P subsets empty-non disjointK of collection a is

}x,...,x,{xX objects n of set a of partition A

K21

n21

• Each cluster must contain at least one object• Each object must belong to exactly one group

P of components called are P },P,...,P,{PP partition the Denote iK21

X.P...PP:X is union their (2)

ji ,P and P all for 0PP :disjoint are they (1)

K21

jiji

4

Partitioning ClusteringWhat is a “good” partitioning clustering?

Key ideas: Objects in each group are similar and objects between different groups are dissimilar.

Minimize the within-group distance and Maximize the between-group distance.

Notice: Many ways to define the “within-group distance” (the average of distance to the group’s center or the average of distance between all pairs of objects, etc.) and to define the “between-group distance”. It is in general impossible to find the optimal clustering.

}},,{,},{,},,,,{{

321

65372109741 PPP

xxxxxxxxxxP

5

Hierarchical Clustering

A hierarchical clustering is a sequence of partitions in which each partition is nested into the next partition in the sequence.

Partition Q is nested into partition P if every component of Q is a subset of a component of P.

(This definition is for bottom-up hierarchical clustering. In case of top-down hierarchical clustering, “next” becomes “previous”).

},,{},,{},,,,,{ 65382109741 xxxxxxxxxxP },{},{},,{},,{},,,{ 63582107941 xxxxxxxxxxQ

6

Bottom-up Hierarchical Clustering

654321 ,,,,, xxxxxx

},{},,,,{ 654321 xxxxxx

},{},{},,,{ 654321 xxxxxx

}{},{},{},,{},{ 654321 xxxxxx

}{},{},{},{},{},{ 654321 xxxxxx

x1 x2 x3 x4 x5 x6

7

Top-Down Hierarchical Clustering

654321 ,,,,, xxxxxx

},{},,,,{ 654321 xxxxxx

},{},{},,,{ 654321 xxxxxx

}{},{},{},,{},{ 654321 xxxxxx

}{},{},{},{},{},{ 654321 xxxxxx

x1 x2 x3 x4 x5 x6

8

OSHAM: Hybrid Model

WisconsinBreastCancerData

Attributes

Brief Descriptionof Concepts

ConceptHierarchy

Multiple Inheritance Concepts

Discovered Concepts

9

Lecture 1: Overview of KDD

Lecture 2: Preparing data

Lecture 3: Decision tree induction

Lecture 4: Mining association rules

Lecture 5: Automatic cluster detection

Lecture 6: Artificial neural networks

Lecture 7: Evaluation of discovered knowledge

Brief introduction to lectures

10

Lecture 6: Neural networks

•One of the most widely used KDD classification techniques.

•Content of the lecture

•Prerequisite: Nothing special

1. Neural network representation2. Feed-forward neural networks3. Using back-propagation algorithm4. Case-studies

11

Lecture 1: Overview of KDD

Lecture 2: Preparing data

Lecture 3: Decision tree induction

Lecture 4: Mining association rules

Lecture 5: Automatic cluster detection

Lecture 6: Artificial neural networks

Lecture 7: Evaluation of discovered knowledge

Brief introduction to lectures

12

Lecture 7 Evaluation of discovered knowledge

•One of the most widely used KDD classification techniques.

•Content of the lecture

1. Cross validation2. Bootstrapping3. Case-studies

•Prerequisite: Nothing special

13

Out-of-sample testing

HistoricalData

(warehouse)Samplingmethod

Sampledata

Samplingmethod

Trainingdata

Inductionmethod

Testingdata

Errorestimation

Model

2/3

1/3

error

The quality of the test sample estimate is dependent on the number of test cases and the validity of the independent assumption

14

Cross Validation

HistoricalData

(warehouse)Samplingmethod

Sampledata

Samplingmethod

Sample 1 Inductionmethod

Sample nError

estimation

Model

Run’serror

10-fold cross validation appears adequate (n = 10)

Sample 2

...

Errorestimation

iterate

- Mutually exclusive- Equal size

15

randomly split the data set into 3 subsets of equal size

run on each 2 subsets as training

data to find knowledge

test on the rest subset as testing data to evaluate the accuracy

average theaccuracies asfinal evaluation

2

3

1

1

2

2A data set

A method to be evaluated

Evaluation: k-fold cross validation (k=3)

1 3

3 2

3 1

16

Outline of the presentation

Objectives,

Prerequisite

and Content

Brief

Introduction

to Lectures

Discussion

and

Conclusion

This presentation summarizes the content and organizationof lectures in module “Knowledge Discovery and Data Mining”