1 lecture 5: automatic cluster detection lecture 6: artificial neural networks lecture 7: evaluation...
Post on 19-Dec-2015
214 views
TRANSCRIPT
1
Lecture 5: Automatic cluster detection
Lecture 6: Artificial neural networks
Lecture 7: Evaluation of discovered knowledge
Brief introduction to lectures
Transparencies prepared by Ho Tu Bao [JAIST]
2
Lecture 5: Automatic Cluster Detection
•One of the most widely used KDD classification techniques for unsupervised data.
•Content of the lecture
1. Introduction2. Partitioning Clustering3. Hierarchical Clustering4. Software and case-studies
•Prerequisite: Nothing special
3
Partitioning Clustering
:conditions following the satisfying ,clusters called often n),(K
X of P,...,P,P subsets empty-non disjointK of collection a is
}x,...,x,{xX objects n of set a of partition A
K21
n21
• Each cluster must contain at least one object• Each object must belong to exactly one group
P of components called are P },P,...,P,{PP partition the Denote iK21
X.P...PP:X is union their (2)
ji ,P and P all for 0PP :disjoint are they (1)
K21
jiji
4
Partitioning ClusteringWhat is a “good” partitioning clustering?
Key ideas: Objects in each group are similar and objects between different groups are dissimilar.
Minimize the within-group distance and Maximize the between-group distance.
Notice: Many ways to define the “within-group distance” (the average of distance to the group’s center or the average of distance between all pairs of objects, etc.) and to define the “between-group distance”. It is in general impossible to find the optimal clustering.
}},,{,},{,},,,,{{
321
65372109741 PPP
xxxxxxxxxxP
5
Hierarchical Clustering
A hierarchical clustering is a sequence of partitions in which each partition is nested into the next partition in the sequence.
Partition Q is nested into partition P if every component of Q is a subset of a component of P.
(This definition is for bottom-up hierarchical clustering. In case of top-down hierarchical clustering, “next” becomes “previous”).
},,{},,{},,,,,{ 65382109741 xxxxxxxxxxP },{},{},,{},,{},,,{ 63582107941 xxxxxxxxxxQ
6
Bottom-up Hierarchical Clustering
654321 ,,,,, xxxxxx
},{},,,,{ 654321 xxxxxx
},{},{},,,{ 654321 xxxxxx
}{},{},{},,{},{ 654321 xxxxxx
}{},{},{},{},{},{ 654321 xxxxxx
x1 x2 x3 x4 x5 x6
7
Top-Down Hierarchical Clustering
654321 ,,,,, xxxxxx
},{},,,,{ 654321 xxxxxx
},{},{},,,{ 654321 xxxxxx
}{},{},{},,{},{ 654321 xxxxxx
}{},{},{},{},{},{ 654321 xxxxxx
x1 x2 x3 x4 x5 x6
8
OSHAM: Hybrid Model
WisconsinBreastCancerData
Attributes
Brief Descriptionof Concepts
ConceptHierarchy
Multiple Inheritance Concepts
Discovered Concepts
9
Lecture 1: Overview of KDD
Lecture 2: Preparing data
Lecture 3: Decision tree induction
Lecture 4: Mining association rules
Lecture 5: Automatic cluster detection
Lecture 6: Artificial neural networks
Lecture 7: Evaluation of discovered knowledge
Brief introduction to lectures
10
Lecture 6: Neural networks
•One of the most widely used KDD classification techniques.
•Content of the lecture
•Prerequisite: Nothing special
1. Neural network representation2. Feed-forward neural networks3. Using back-propagation algorithm4. Case-studies
11
Lecture 1: Overview of KDD
Lecture 2: Preparing data
Lecture 3: Decision tree induction
Lecture 4: Mining association rules
Lecture 5: Automatic cluster detection
Lecture 6: Artificial neural networks
Lecture 7: Evaluation of discovered knowledge
Brief introduction to lectures
12
Lecture 7 Evaluation of discovered knowledge
•One of the most widely used KDD classification techniques.
•Content of the lecture
1. Cross validation2. Bootstrapping3. Case-studies
•Prerequisite: Nothing special
13
Out-of-sample testing
HistoricalData
(warehouse)Samplingmethod
Sampledata
Samplingmethod
Trainingdata
Inductionmethod
Testingdata
Errorestimation
Model
2/3
1/3
error
The quality of the test sample estimate is dependent on the number of test cases and the validity of the independent assumption
14
Cross Validation
HistoricalData
(warehouse)Samplingmethod
Sampledata
Samplingmethod
Sample 1 Inductionmethod
Sample nError
estimation
Model
Run’serror
10-fold cross validation appears adequate (n = 10)
Sample 2
...
Errorestimation
iterate
- Mutually exclusive- Equal size
15
randomly split the data set into 3 subsets of equal size
run on each 2 subsets as training
data to find knowledge
test on the rest subset as testing data to evaluate the accuracy
average theaccuracies asfinal evaluation
2
3
1
1
2
2A data set
A method to be evaluated
Evaluation: k-fold cross validation (k=3)
1 3
3 2
3 1