fuzzy lecturer
TRANSCRIPT
-
7/29/2019 Fuzzy Lecturer
1/101
1
Introduction toPredictive Learning
Electrical and Computer Engineering
LECTURESET 6
Neural Network Learning
-
7/29/2019 Fuzzy Lecturer
2/101
2
OUTLINE Objectives
- introduce biologically inspired NN learning methods forclustering, regression and classification
- explain similarities and differences between statistical and NNmethods
- showexamples using synthetic and real-life data Brief history and motivation for artificial
neural networks
Sequential estimation of model parameters
Methods for supervised learning Methods for unsupervised learning
Summary and discussion
-
7/29/2019 Fuzzy Lecturer
3/101
3
Brief history and motivation for ANN
Huge interest in understanding the nature and
mechanism ofbiological/ human learning Biologists + psychologists do not adopt classical
parametric statistical learning, because:
- parametric modeling is not biologically plausible
- biological info processing is clearly different from
algorithmic models of computation Mid 1980s: growing interest in applying biologically
inspired computational models to:
- developing computer models (of human brain)
- various engineering applications
New fieldArtificial Neural Networks (~1986 1987)
ANNs representnonlinear estimators implementingthe ERM approach (usually squared-loss function)
-
7/29/2019 Fuzzy Lecturer
4/101
4
History and motivation (contd) Relationship to the problem ofinductive learning:
The same learning problem setting
Neural-style learning algorithm:
- on-line (flow through)
- simple processing
Biological terminology
Generator
of samples
Learning
Machine
System
x
y
y
x y
w ~ xy
w
Hebbian Rule:
Synapse
-
7/29/2019 Fuzzy Lecturer
5/101
5
Neural vs Algorithmic computation
Biological systems do not use principles of
digital circuitsDigital Biological
Connectivity 1~10 ~10,000
Signal digital analogTiming synchronous asynchronous
Signal propag. feedforward feedback
Redundancy no yes
Parallel proc. no yes
Learning no yes
Noise tolerance no yes
-
7/29/2019 Fuzzy Lecturer
6/101
6
Neural vs Algorithmic computation
Computers excel at algorithmic tasks (well-
posed mathematical problems) Biological systems are superior to digital
systems forill-posed problems with noisy data
Example: object recognition [Hopfield, 1987]
PIGEON:~ 10^^9 neurons, cycle time ~ 0.1 sec,each neuron sends 2 bits to ~ 1K other neurons
2x10^^13 bit operations per sec
OLD PC: ~ 10^^7 gates, cycle time 10^^-7, connectivity=2
10x10^^14 bit operations per sec
Both have similar raw processing capability, but pigeonsare better at recognition tasks
-
7/29/2019 Fuzzy Lecturer
7/101
7
Neural terminology and artificial neurons
Some general descriptions of ANNs:http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html
http://en.wikipedia.org/wiki/Neural_network
McCulloch-Pitts neuron (1943)
Threshold (indicator) function of weighted sum of inputs
http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.htmlhttp://en.wikipedia.org/wiki/Neural_networkhttp://en.wikipedia.org/wiki/Neural_networkhttp://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html -
7/29/2019 Fuzzy Lecturer
8/101
8
Goals of ANNs
Develop models of computation inspired bybiological systems
Study computational capabilities of networks
of interconnected neurons Apply these models to real-life applications
Learning in NNs = modification (adaptation) ofsynaptic connections (weights) in response to
external inputs
-
7/29/2019 Fuzzy Lecturer
9/101
9
Historical highlights of ANN
1943 McCulloch-Pitts neuron1949 Hebbian learning
1960s Rosenblatt (perceptron), Widrow
60s-70s dominance of hard AI1980s resurgence of interest (PDP
group, MLP etc.)
1990s connection to statistics/VC-theory2000s mature field/ unnecessary
fragmentation
-
7/29/2019 Fuzzy Lecturer
10/101
10
OUTLINE
Objectives Brief history and motivation for
artificial neural networks
Sequential estimation of modelparameters
Methods for supervised learning
Methods for unsupervised learning
Summary and Discussion
-
7/29/2019 Fuzzy Lecturer
11/101
11
Sequential estimation of model parameters
Batch vs on-line (iterative) learning
- Algorithmic (statistical) approaches ~ batch- Neural-network inspired methods ~ on-line
BUTthe difference is only on the implementation level (soboth types of learning methods should yield similar
generalization performance)
Recall ERM inductive principle (for regression):
Assume dictionary parameterization with fixed basis fcts
n
i
n
i
iiiiemp fy
n
yL
n
R1 1
2,
1,,
1wxwxw
y f x,w wj gj x j 1
m
-
7/29/2019 Fuzzy Lecturer
12/101
12
Sequential (on-line) least squares minimization
Training pairs presented sequentially
On-line update equations for minimizingempirical risk (MSE) wrt parameters w are:
(gradient descent learning)
wherethe gradient is computed via the chain rule:
thelearning rate is a small positive value(decreasing with k)
)(),( kykx
wxw
ww ,,1 kykLkk k
wjLx
,y,w
L
y
y
wj 2
y y gjx
k
-
7/29/2019 Fuzzy Lecturer
13/101
-
7/29/2019 Fuzzy Lecturer
14/101
14
Neural network interpretation of delta rule
Forward pass Backward pass
1 z1
k zm k
y k
w0 k w1
k wm k
1 z1
k zm k
k y k y k
wj k k k zj k w
j
k 1
wj
k
w
j
k
Biological learning
x y
w ~ xy
w
Hebbian Rule:
Synapse
-
7/29/2019 Fuzzy Lecturer
15/101
15
Theoretical basis for on-line learning Standard inductive learning: given training
data find the model providing minofprediction risk
Stochastic Approximation guaranteesminimization of risk (asymptotically):
under general conditions
on the learning rate:
R L z, p z dzz1,...,zn
k 1 k k grad L zk, k limk
k 0
kk1
k2
k
1
-
7/29/2019 Fuzzy Lecturer
16/101
16
Practical issues for on-line learning Given finite training set (nsamples):
this set is presented sequentially to a learning algorithmmany times. Each presentation ofnsamples is calledan epoch, and the process of repeated presentations is
called recycling (of training data)
Learning rate schedule: initially set large, then slowlydecreasing with k (iteration number). Typically goodlearning rate schedules are data-dependent.
Stopping conditions:
(1) monitor the gradient (i.e., stop when the gradient
falls below some small threshold)
(2) early stopping can be used for complexity control
z1,...,zn
-
7/29/2019 Fuzzy Lecturer
17/101
17
OUTLINE
Objectives Brief history and motivation for artificialneural networks
Sequential estimation of model parameters
Methods for supervised learning- MultiLayer Perceptron (MLP) networks
- Radial Basis Function (RBF) Networks
Methods for unsupervised learning
Summary and discussion
-
7/29/2019 Fuzzy Lecturer
18/101
18
Multilayer Perceptrons (MLP)
Recall graphical NN
representation for
dictionary methods:
where
How to estimate parameters (weights) via ERM?
W is m 1
1 2 m
V is d m
y wjzjj 1
m
zj g x,vj
x1 x2 xd
z1 z2 zm
gx,vi s vi 0 xkvikk1
d
s x v i
s t 11 exp t
s t tanh t exp t exp t exp t exp t
-
7/29/2019 Fuzzy Lecturer
19/101
19
Learning for a single neuron (delta rule):
Forward pass Backward pass
1 z1
k zm k
y k
w0 k w1
k wm k
1 z1
k zm k
k y k y k
wj k k k zj k w
j
k 1 wj
k wj
k
How to implement gradient-descent learning in anetwork of neurons?
-
7/29/2019 Fuzzy Lecturer
20/101
20
Backpropagation training
Minimization ofwith respect to parameters (weights) W, V
Gradient descent optimization for
where
Careful application of gradient descent leads
leads to backpropagation algorithm
n
iiiemp yfR 1
2
,,VWx
V k 1 V k k gradV L x k ,y k ,V k ,w k w k 1 w k k gradw L x k ,y k ,V k ,w k
k1,...,
n,...
L x k ,y k ,V k ,w k 1
2 f x,w,V y 2
-
7/29/2019 Fuzzy Lecturer
21/101
21
Backpropagation: forward pass
for training input x(k), estimate predicted output ky
-
7/29/2019 Fuzzy Lecturer
22/101
22
Backpropagation: backward passupdate the weights by propagating the error
-
7/29/2019 Fuzzy Lecturer
23/101
23
Details of backpropagation
Sigmoid activation - picture? simple derivative
Poor behaviour for large t~ saturation
How to avoid saturation?- Proper initialization (small weights)
- Pre-scaling of inputs (zero mean, unit variance)
Learning rate schedule (initial, final) Stopping rules, number of epochs
Number of hidden units
s t 1
1 exp t s t s t 1 s t
-
7/29/2019 Fuzzy Lecturer
24/101
24
Additional enhancements
The problem: convergence may be very slowfor error functional with different curvatures:
Solution: add momentum term tosmoothoscillations
where and is momentum parameter
w k1 w k k z k w k
w k w k w k1
-
7/29/2019 Fuzzy Lecturer
25/101
25
Regularization Effect of Backpropagation
Backpropagation ~ iterative optimization
Final model (weights) depends on:- initial point + final point (stopping rules)
initialization and/ or stopping rules can be usedfor model complexity control
-
7/29/2019 Fuzzy Lecturer
26/101
26
Various forms of complexity control
MLP topology ~ number of hidden units
Constraints on parameters (weights) ~weight decay
Type of optimization algorithm (many
versions of backprop., other opt. methods) Stopping rules
Initial conditions(initial small weights)
Multiple factors make it difficult to controlcomplexity; usually vary one complexityparameter while keeping all others fixed
-
7/29/2019 Fuzzy Lecturer
27/101
-
7/29/2019 Fuzzy Lecturer
28/101
28
Example: univariate regression
Data set: 30 samples generatedusing sine-squared
target function with Gaussian noise (st. deviation 0.1). MLP network
(five hidden units)
near optimal
0 0.2 0.4 0.6 0.8 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
X
Y
-
7/29/2019 Fuzzy Lecturer
29/101
29
Example: univariate regression
Data set: 30 samples generatedusing sine-squared
target function with Gaussian noise (st. deviation 0.1). MLP network
(20 hidden units)
little overfitting
0 0.2 0.4 0.6 0.8 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
X
Y
-
7/29/2019 Fuzzy Lecturer
30/101
30
Backpropagation for classification
Original MLP is forregression
(as shown)
Forclassification:
- use sigmoid output unit- during training, use real-values 0/1 for class labels
- during operation, threshold the output of a trained MLP
classifier at 0.5 to predict class labels
W is m 1
1 2 m
V is d m
y wjzjj 1
m
zj gx,vj
x1 x2 xd
z1 z2 zm
-
7/29/2019 Fuzzy Lecturer
31/101
31
Classification example (Ripleys data set)
Data set: 250 samples ~ mixture of gaussians, where
Class 0data has centers (-0.3, 0.7) and (0.4, 0.7), andClass 1 data has centers (-0.7, 0.3) and (0.3, 0.3).
The variance of all gaussians is 0.03.
MLP classifier
(two hidden units)
underfitting
-1.5 -1 -0.5 0 0.5 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
-
7/29/2019 Fuzzy Lecturer
32/101
-
7/29/2019 Fuzzy Lecturer
33/101
33
Classification Example
MLP classifier(six hidden units)
some overfitting
-1.5 -1 -0.5 0 0.5 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
-
7/29/2019 Fuzzy Lecturer
34/101
34
MLP software
MLP software widely available in public domain
For example, Netlab toolbox (in Matlab) athttp://www1.aston.ac.uk/eas/research/groups/n
crg/resources/netlab/
Many commercial products (full of Neural
Network marketing hype), i.e.
Nearly 80% Accurate Market Forecasting Software
Get FREE up to date predictions and see for yourself!
NetTalk
http://www1.aston.ac.uk/eas/research/groups/ncrg/resources/netlab/http://www1.aston.ac.uk/eas/research/groups/ncrg/resources/netlab/http://www1.aston.ac.uk/eas/research/groups/ncrg/resources/netlab/http://www1.aston.ac.uk/eas/research/groups/ncrg/resources/netlab/ -
7/29/2019 Fuzzy Lecturer
35/101
35
NetTalk(Sejnowski and Rosenberg, 1987)
One of the first successful applications of backpropagation:
http://www.cnl.salk.edu/ParallelNetsPronounce/index.php Goal: Learning to read (English text) aloud, i.e.
Learn Mapping:English textphonemes
using MLP network
Network inputs encode 7-letter window (the 4-th letterin the middle needs to be pronounced)
Network outputs (26 units) encode phonemes thatdrive a speech synthesizer The MLP network is trained using labeled data (both
individual words and unrestricted text)
N tT lk hit t
http://www.cnl.salk.edu/ParallelNetsPronounce/index.phphttp://www.cnl.salk.edu/ParallelNetsPronounce/index.php -
7/29/2019 Fuzzy Lecturer
36/101
36
NetTalk architecture
Input encoding: 7x29 = 203 units
Output encoding: 26 units (phonemes)Hidden layer: 80 hidden units
Li t i t N tT lk t d h
-
7/29/2019 Fuzzy Lecturer
37/101
37
Listening to NetTalk-generated speech
Listen to tape recordings illustrating NETtalk operation. These recordings
are available (in MP3 format) from an article in Wikipedia at
http://en.wikipedia.org/wiki/NETtalk_(artificial_neural_network)
This article has a link to the audio examples of the neural network as it
progresses through training. Specifically, it hasthree recordings contain 3
different audio outputs of NETtalk:
(a) during the first 5 minutes of training, starting with weights initialized to zero.
(b) after training using the set of 10,000 words. This training set corresponds to 20
passes (epochs) over 500-word text.
(c) generated with new text input from transcription that was not part of the training
set.
After listening to these recordings, answer and comment on the following questions:- can you recognize words in the beginning of recording (a)? in the end of (a)?
- compare the quality of outputs (b) and (c). Which one seems closer to human
speech and why?
http://en.wikipedia.org/wiki/NETtalk_(artificial_neural_network)http://en.wikipedia.org/wiki/NETtalk_(artificial_neural_network) -
7/29/2019 Fuzzy Lecturer
38/101
38
NETtalk: question for discussion
NETtalk system uses a seven-letter window for
text input. Try to justify this choice (of window
size) based on the properties of natural English
language. How the performance of NETtalk would
change if a small window (of size 3 letters) or alarge window (of size 21 letters) is used instead?
-
7/29/2019 Fuzzy Lecturer
39/101
39
Radial Basis Function (RBF) networks
Dictionary parameterization:
- each b.f. is (usually) local
- center and width
i.e. Gaussian:
Typically used for regression or classification
W is m 1
1 2 m
V is d m
y wjz
jj 1
m
zj gx,vj
x1 x2 xd
z1 z2 zm
01
g wwfm
j j
j
jm
vxx
2
2
12
2
2exp
2expg
vxx
d
j
jj vx
vj j
-
7/29/2019 Fuzzy Lecturer
40/101
40
RBF network training
RBF training (learning) ~ estimation of
(1) RBF parameters (centers, width)
(2) linear weightsws
Non-adaptive implementation:
(1) Estimate RBF parameters via unsupervised learning(only x-values of training data) can use SOM, GLA etc.
(2) Estimate weights w via linear least squares
Advantages:
- fast training;
- when x-samples are plenty, but (x,y) data are few
Limitations: cannot discard irrelevant inputs
the curse of dimensionalty
-
7/29/2019 Fuzzy Lecturer
41/101
41
Non-adaptive RBF training algorithm1. Choose the number of basis functions
(centers)m.2. Estimate centers using x-values of training data
via unsupervised learning (SOM, GLA, clustering etc.)
3. Determine width parameters using heuristic:
For a given center(a) find the distance to the closest center:
for all
(b) set the width parameter
where parameter controls degree of overlap betweenadjacent basis functions. Typically
4. Estimate weights wvia linear least squares(minimization of the empirical risk).
vj
vj
j
rj minkvk vj k j
j rj
1 3
-
7/29/2019 Fuzzy Lecturer
42/101
42
RBF network complexity control
RBF model complexity can be controlled by The number of RBFs:
Goal: select opt number of units (RBFs)
RBF width:Goal: select opt width parameter (for largenumber of RBFs)
Penalization of large weights
ws
See toy examples next (using the number ofunits as the complexity parameter)
-
7/29/2019 Fuzzy Lecturer
43/101
43
Example: RBF regression
Data set: 30 samples generatedusing sine-squared
target function with Gaussian noise (st. deviation 0.1). RBF network: automatic width selection
2 RBFs
underfitting
0 0.2 0.4 0.6 0.8 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
X
Y
-
7/29/2019 Fuzzy Lecturer
44/101
44
Example: RBF regression
Data set: 30 samples generatedusing sine-squared
target function with Gaussian noise (st. deviation 0.1). RBF network: automatic width selection
5 RBFs
~ optimal
0 0.2 0.4 0.6 0.8 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
X
Y
-
7/29/2019 Fuzzy Lecturer
45/101
45
Example: RBF regression
Data set: 30 samples generatedusing sine-squared
target function with Gaussian noise (st. deviation 0.1). RBF network: automatic width selection
20 RBFs
overfitting
0 0.2 0.4 0.6 0.8 1-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
X
Y
-
7/29/2019 Fuzzy Lecturer
46/101
46
RBF Classification example (Ripleys data)
Data set: 250 samples ~ mixture of gaussians, where
Class 0data has centers (-0.3, 0.7) and (0.4, 0.7), andClass 1 data has centers (-0.7, 0.3) and (0.3, 0.3).
The variance of all gaussians is 0.03.
RBF classifier
(4 units)
little underfitting
-1.5 -1 -0.5 0 0.5 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2
-
7/29/2019 Fuzzy Lecturer
47/101
-
7/29/2019 Fuzzy Lecturer
48/101
-
7/29/2019 Fuzzy Lecturer
49/101
-
7/29/2019 Fuzzy Lecturer
50/101
50
Overview
Recall from Lecture Set 2:
unsupervised learning
data reduction approach
Example:Training data represented by 3 centers
H
-
7/29/2019 Fuzzy Lecturer
51/101
-
7/29/2019 Fuzzy Lecturer
52/101
52
2. Dimensionality reduction:linear nonlinear
Note: the goal is to estimate a mapping from d-dimensionalinput space (d=2) to low-dim. feature space (m=1)
R x f x, 2p x dx
x1
x2
-
7/29/2019 Fuzzy Lecturer
53/101
53
Unsupervised Learning: Formalization
Unsupervised learning
~ mapping from the input space (x) to somemodel space
For VQ/ clustering:
a model is a set of centers (cluster centers) For dimensionality reduction:
a model is a low-dim. Space
Note 1: two types of problems can be combined
Note 2:unsupervised learning requiresestimation of two mappings xz x*
z = F(x) and x* = G(z)
-
7/29/2019 Fuzzy Lecturer
54/101
-
7/29/2019 Fuzzy Lecturer
55/101
55
Generalized Lloyd Algorithm(GLA) for VQ
Givendata points , loss function L (i.e.,
squared loss) and initial centers
Performthe following updates upon presentation of1. Find the nearest center to the data point (the
winning unit):
2. Update the winning unit coordinates (only) via
Increment kand iterate steps (1) (2) aboveNote:- the learning rate decreases with iteration numberk
-biological interpretations of steps (1)-(2) exist
cj 0 j 1,...,mx k k 1,2,...
x k
kkj ii
cx minarg
kkkkk jjj cxcc 1
Batch version of GLA
-
7/29/2019 Fuzzy Lecturer
56/101
56
Batch version of GLAGivendata points , loss function L (i.e.,
squared loss) and initial centers
Iteratethe following two steps1. Partition the data (assign sample to unitj)using the nearest neighbor rule. Partitioning matrix Q:
2. Update unit coordinates as centroids of the data:
Note:final solution may depend on initialization (local min) potential problem for both on-line and batch GLA
cj 0 j 1,...,mxi i 1,...,n
qij 1 if L x i,cj k
min
l
L x i,cl k 0 otherwise
cj k 1
qijx ii1
n
qiji 1
n
, j 1, . .. ,m
xi
Statistical Interpretation of GLA
-
7/29/2019 Fuzzy Lecturer
57/101
57
Statistical Interpretation of GLAIteratethe following two steps
1. Partition the data (assign sample to unitj) usingthe nearest neighbor rule. Partitioning matrix Q:
~ Projection of the data onto model space(units) F(x)
2. Update unit coordinates as centroids of the data:
~Conditional Expectation (averaging, smoothing) G(z)conditional upon results of partitioning step (1)
qij 1 if L x i,cj k min
lL x i,cl k
0 otherwise
cj k 1
qijx ii1
n
qiji 1
n
, j 1, . .. ,m
xi
-
7/29/2019 Fuzzy Lecturer
58/101
58
Numeric Example of univariate VQGiven data: {2,4,10,12,3,20,30,11,25}, set m=2
Initialization (random): c1
=3,c2
=4
Iteration 1
Projection: P1={2,3} P2={4,10,12,20,30,11,25}
Expectation (averaging): c1=2.5, c2=16
Iteration 2
Projection: P1={2,3,4}, P2={10,12,20,30,11,25}Expectation(averaging): c1=3, c2=18
Iteration 3
Projection: P1={2,3,4,10},P2={12,20,30,11,25}Expectation(averaging): c1=4.75, c2=19.6
Iteration 4Projection: P1={2,3,4,10,11,12}, P2={20,30,25}Expectation(averaging): c1=7, c2=25
Stop as the algorithm is stabilized with these values
-
7/29/2019 Fuzzy Lecturer
59/101
-
7/29/2019 Fuzzy Lecturer
60/101
60
GLA Example 2 Modeling doughnut distribution using 20 units:
7 units were never moved by the GLA
the problem of unused units (dead units)
-
7/29/2019 Fuzzy Lecturer
61/101
61
Avoiding local minima with GLA Starting with many random initializations,
and then choosing the best GLA solution Conscience mechanism: forcing dead
units to participate in competition, by keepingthe frequency count (of past winnings) foreach unit,
i.e. for on-line version of GLA in Step 1
Self-Organizing Map: introduce topologicalrelationship (map), thus forcing the neighborsof the winning unit to move towards the data.
)(minarg kfreqkkj iii
cx
-
7/29/2019 Fuzzy Lecturer
62/101
62
Clustering methods
Clustering: separating a data set intoseveral groups (clusters) according to somemeasure of similarity
Goals of clustering:
interpretation (of resulting clusters)exploratory data analysis
preprocessing for supervised learning
often the goal is not formally stated
VQ-style methods (GLA) often used forclustering, i.e. k-meansorc-means
Many other clustering methods as well
-
7/29/2019 Fuzzy Lecturer
63/101
63
Clustering (contd) Clustering: partition a set of n objects
(samples) into k disjoint groups, based onsome similarity measure. Assumptions:
- similarity ~ distance metric dist (i,j)- usually k given a priori (but not always!)
Intuitive motivation:similar objects into one clusterdissimilar objects into different clusters
the goal is not formally stated
Similarity (distance) measure is criticalbutusually hard to define (objectively).Distance needs to be defined fordifferenttypes of input variables.
-
7/29/2019 Fuzzy Lecturer
64/101
64
Applications of clustering
Marketing:explore customers data to identify buying
patterns for targeted marketing (Amazon.com)
Economic data:
identify similarity between different countries,
states, regions, companies, mutual funds etc.
Web data:
cluster web pages or web users to discovergroups of similar access patterns
Etc., etc.
-
7/29/2019 Fuzzy Lecturer
65/101
65
Clustering Methods Many different approaches developed in
- neural networks
- mathematics (graph theory, linearalgebra)
- pattern recognition- data mining etc.
Example Graph-theoretic approach:
Minimum Spanning Tree (MST) clustering Types of clustering methods:
hierarchical, partitional, fuzzy clustering.
-
7/29/2019 Fuzzy Lecturer
66/101
66
K-means clustering (~ GLA)
This is representative Partitional Clustering method.
Given a data set ofnsamples and the value of k:
Step 0: (arbitrarily) initialize cluster centers
Step 1: assign each data point (object) to the cluster
with the closest cluster centerStep 2: calculate the mean (centroid) of data points
in each cluster as estimated cluster centers
Iterate steps 1 and 2 until the cluster membership isstabilized
xi
-
7/29/2019 Fuzzy Lecturer
67/101
67
The K-MeansClustering Method
Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrarily choose Kobject as initialcluster center
Assigneachobjectsto mostsimilarcenter
Updatethe
clustermeans
Updatetheclustermeans
reassign
reassign
-
7/29/2019 Fuzzy Lecturer
68/101
68
Clustering of High-Dimensional Data
Additional Challenges
- many clustering methods rely on intuition for
low-dimensional data
- visualization is possible only in 2D or 3D.
Multidimensional scaling (MDS)aims to produce 2D representation of inter-point
distances between high-dimensional samples.
MDS finds a set of points in 2D space,which minimizes the stress function
ij
ji
jiijnmS2
21 ,,, zzzzz
nzzZ ,...,1
-
7/29/2019 Fuzzy Lecturer
69/101
69
Self-Organizing MapsHistory and biological motivation
Brain changes its internal structure to reflectlife experiences interaction withenvironment is critical at early stages ofbrain development (first 1-2 years of life)
Existence of various regions (maps) in thebrain
How these maps may be formed?
i.e. information-processing model leading tomap formation
T. Kohonen (early 1980s) proposed SOM
Goal of SOM
-
7/29/2019 Fuzzy Lecturer
70/101
70
Goal of SOM
Dimensionality reduction: project given (high-
dimensional) data onto low-dimensional space (map) Feature space (Z-space) is 1D or 2D and is discretizedas a number of units, i.e., 10x10 map
Z-space has distance metricordering of units Similarities and differences between VQ and SOM
X F(Z)ZG(X)
X
-
7/29/2019 Fuzzy Lecturer
71/101
71
Self-Organizing MapDiscretization of 2D space via 10x10 map. In this discrete
space, distance relations exist between all pairs of units.Distance relation ~ map topology
Units in 2D feature space
SOM Algorithm (flow through)
-
7/29/2019 Fuzzy Lecturer
72/101
72
SOM Algorithm (flow through)
Givendata points , distance metric in the
input space (~ Euclidean), map topology (in z-space),initial position of units (in x-space)
Performthe following updates upon presentation of
1. Find the nearest center to the data point (thewinning unit):
2. Update all units around the winning unit via
Increment k,decrease the learning rate and theneighborhood width,and repeat steps (1) (2) above
cj 0 j 1,...,m
x k k 1,2,...
x k
1minarg)(* kkk ii
cxz
1*,1 kkKkkk jkjj cxzzcc
SOM l (1 i i )
-
7/29/2019 Fuzzy Lecturer
73/101
73
SOM example (1-st iteration)
Step 1:
Step 2:
SOM l ( t it ti )
-
7/29/2019 Fuzzy Lecturer
74/101
74
SOM example (next iteration)
Step 1:
Step 2:
Final map
Hyper-parameters of SOM
-
7/29/2019 Fuzzy Lecturer
75/101
75
Hyper parameters of SOM
SOM performance depends on parameters (~ user-defined):
Map dimension and topology (usually 1D or 2D) Number of SOM units ~ quantization level (ofz-space) Neighborhood function ~ rectangular or gaussian (not
important)
Neighborhood width decrease schedule (important),
i.e. exponential decrease for Gaussian
with user defined:
Also linear decrease of neighborhood width
Learning rate schedule (important)(also linear decrease)
Note: learning rate and neighborhood decrease should beset jointly
kK k 2
2
2
zzzz exp, maxkkinitialfinalinitialk
maxk initial final
maxk
k
initial
finalinitialk
M d li if di t ib ti i SOM
-
7/29/2019 Fuzzy Lecturer
76/101
76
Modeling uniform distribution via SOM
(a) 300 random samples (b) 10X10 map
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SOM neighborhood: Gaussian
Learning rate: linear decrease )/1(1.0)( maxkkk
Position of SOM units: (a) initial, (b) after 50 iterations,( ) f 100 i i (d) f 10 000 i i
-
7/29/2019 Fuzzy Lecturer
77/101
77
(c) after 100 iterations, (d) after 10,000 iterations
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Batch SOM (similar to batch GLA)
-
7/29/2019 Fuzzy Lecturer
78/101
78
Batch SOM (similar to batch GLA)Givendata points , loss function L (i.e.,
squared loss) and initial centers
Iteratethe following stepsPartition the data (assign sample to unitj) using the
nearest neighbor rule. Partitioning matrix Q:
Update unit coordinates as weighted average of all samples:
where is the weight of sample
Decrease the neighborhood widthIterate (repeat)above steps max number of iterations Kmax
cj 0 j 1,...,mxi i 1,...,n
qij 1 if L x i,cj k min
lL x i,cl k
0 otherwise
xi
n
i
ij
n
i
iji
j
K
K
kc
1
1
,
,
)1(
zz
zzx
ix),( ijK zz
SOM Example 1
-
7/29/2019 Fuzzy Lecturer
79/101
79
SOM Example 1
Modeling doughnut distribution using batch SOM linear (1D) SOM topology with 5 units
- initial neighborhood =1, final neighborhood = 0.05
- number of iterations Kmax=50
Note: no unused units:
-1.5 -1 -0.5 0 0.5 1 1.5-1.5
-1
-0.5
0
0.5
1
1.5
X1
X2
SOM Example 2
-
7/29/2019 Fuzzy Lecturer
80/101
80
SOM Example 2 Modeling the same doughnut distribution using:
square grid (2D) SOM topology with 5x5=25 units
- initial neighborhood =1, final neighborhood = 0.05
- number of iterations Kmax=50
Note:final model not sensitive
to poor initialization
-1.5 -1 -0.5 0 0.5 1 1.5-1.5
-1
-0.5
0
0.5
1
1.5
X1
X2
SOM A li ti d V i ti
-
7/29/2019 Fuzzy Lecturer
81/101
81
SOM Applications and Variations
Main web site: Helsinki University of Technology (HUT)http://www.cis.hut.fi/research/som-research/
Numerous Applications
Marketing surveys/ segmentation
Financial/ stock market data
Text data / document map WEBSOM
Image data / picture map - PicSOMsee HUT web site
Practical Issues for SOM
http://www.cis.hut.fi/research/som-research/http://www.cis.hut.fi/research/som-research/http://www.cis.hut.fi/research/som-research/http://www.cis.hut.fi/research/som-research/ -
7/29/2019 Fuzzy Lecturer
82/101
82
Practical Issues for SOM Pre-scaling of inputs, usually to [0, 1]
range. Why? Map topology: usually 1D or 2D
Number of map units (per dimension)
Learning rate schedule (for on-lineversion)
Neighborhood type and schedule:
Initial size (~1), final sizeFinal neighborhood size + number ofunits determine model complexity.
Modeling US states using 1D SOM
-
7/29/2019 Fuzzy Lecturer
83/101
83
Modeling US states using 1D SOM
Purpose: clustering of US states
Data encoding: each state described by 5socio-economic indicators: obesity index,
result of 2004 presidential elections, median
income, mean NAEP, IQ score Data scaling: each input scaled
independently to [0,1] range
SOM specs: 1D map, 9 units, initialneighborhood width 1, final width 0.05
State Obesity index Election 04 Median Income Mean NAEP IQ score
-
7/29/2019 Fuzzy Lecturer
84/101
84
State Obesity index Election_04 Median Income Mean NAEP IQ score
Hawaii 17 0 49775 238 94
Colorado 17 1 49617 252 104
Connecticut 18 0 53325 255 99
Massachusetts 18 0 50587 257 111
New Hampshire 18 1 53549 257 102Utah 18 1 48537 250 89
California 19 0 48113 238 94
Maryland 19 0 55912 248 95
New Jersey 19 0 53266 253 103
Rhode Island 19 0 44311 245 89
Vermont 19 0 41929 256 102
Florida 19 1 38533 245 87
Montana 19 1 33900 254 100
Oregon 20 0 42704 250 100
Arizona 20 1 41554 241 92
Idaho 20 1 38613 249 96
New Mexico 20 0 35251 235 85
Wyoming 20 1 40499 253 102
Maine 21 0 37654 253 99
New York 21 0 42432 251 90
Washington 21 0 44252 251 92
South Dakota 21 1 38755 254 100Delaware 22 0 50878 250 90
Illinois 22 0 45906 248 93
Minnesota 22 0 54931 256 113
Wisconsin 22 0 46351 252 105
Nevada 22 1 46289 239 92
Alaska 23 1 55412 245 92
-
7/29/2019 Fuzzy Lecturer
85/101
85
Iowa 23 0 41827 253 109
Kansas 23 1 42523 253 101
Missouri 23 1 43955 251 92Nebraska 23 1 43566 251 101
North Dakota 23 1 36717 254 111
Ohio 23 1 43332 252 107
Oklahoma 23 1 35500 244 98
Pennsylvania 24 0 43577 249 99
Arkansas 24 1 32423 242 98
Georgia 24 1 43316 243 93
Indiana 24 1 41581 251 105
North Carolina 24 1 38432 252 106
Virginia 24 1 49974 253 99
Michigan 25 0 45335 249 99
Kentucky 25 1 37893 247 94
Tennessee 25 1 36329 241 90
Alabama 26 1 36771 236 90
Louisiana 26 1 33312 238 99
South Carolina 26 1 38460 246 87
Texas 26 1 40659 247 98Mississippi 27 1 32447 236 90
West Virginia 28 1 30072 245 92
SOM Modeling 1 of US states
-
7/29/2019 Fuzzy Lecturer
86/101
86
SOM Modeling 1 of US statesUnit States (assigned to each unit)
1 HI, CA, MD, RI, NM,
2 OR, ME, NY, WA, DE, IL, PA, MI,
3 CT, MA, NJ, VT, MN, WI,
4
5 CO, NH, MT, WY, SD,
6 KS, NE, ND, OH, IN, NC, VA,
7 UT, ID, AK, IA, MO,
8 FL, AZ, NV, OK, GA, KY, TX
9 AR, TN, AL, LA, SC, MS, WV
-
7/29/2019 Fuzzy Lecturer
87/101
87
SOM Modeling 2 of US states
-
7/29/2019 Fuzzy Lecturer
88/101
88
- remove voting input and apply 1D SOM:Unit States
1 CO, CT, MA, NH, NJ, MN,
2 WI, IA, ND, OH, IN, NC,
3 VT, MT, OR, ID, WY, ME, SD,
4 KS, MO, NE, PA, VA, MI,
5 UT, MD, NY, WA, DE, IL, AK,
6 HI, CA , RI,
7 FL, AZ, NM, NV,
8 OK, GA, KY, SC, TX,
9 AR, TN, AL, LA, MS, WV
SOM Modeling 2 of US states (contd)
-
7/29/2019 Fuzzy Lecturer
89/101
89
- remove voting input and apply 1D SOM:
Tree structured SOM
-
7/29/2019 Fuzzy Lecturer
90/101
90
Tree-structured SOM
Fixed SOM topologygives poor modeling of
structured distributions:
Minimum SpanningTree SOM
-
7/29/2019 Fuzzy Lecturer
91/101
91
Minimum SpanningTree SOM
Define SOM topology adaptively during each iteration
of SOM algorithm Minimum Spanning Tree (MST) topology ~ according
to distance between units (in the input space)
Topological distance ~ number of hops in MST
123
Example of using MST SOM
-
7/29/2019 Fuzzy Lecturer
92/101
92
Example of using MST SOM
Modeling cross distribution
MST topology vs fixed 2D grid map
Application: skeletonization of images
-
7/29/2019 Fuzzy Lecturer
93/101
93
Singh at al, Self-organizing maps for the skeletonization of sparse
shapes, IEEE Trans Neural Networks, 11, Issue 1, Jan 2000
Skeletonization of noisy images
Application of MST SOM: robustness with
respect to noise
Clustering of European Languages
-
7/29/2019 Fuzzy Lecturer
94/101
94
g p g g
Background historical linguistics studiesrelatedness btwn languages based on
phonology, morphology, syntax and lexicon
Difficulty of the problem:due to evolving
nature of human languages and globalization.
Hypothesis: similarity based on analysis of asmall stable word set.
See glottochronology, Swadesh list, athttp://en.wikipedia.org/wiki/Glottochronology
SOM for clustering European languages
http://en.wikipedia.org/wiki/Glottochronologyhttp://en.wikipedia.org/wiki/Glottochronology -
7/29/2019 Fuzzy Lecturer
95/101
95
g p g gModeling approach: language ~ 10 word set.
Assuming words in different languages are encodedin the same alphabet, it is possible to performclustering using some distance measure.
Issues:
selection of stable word set
data encoding + distance metric
Stable word set: numbers 1 to 10
Data encoding: Latin (English) alphabet,use 3 first letters (in each word)
Numbers word set in 18 European languages
-
7/29/2019 Fuzzy Lecturer
96/101
96
p g g
Each language is a feature vector encoding 10 words
En
glish
Norwegia
n
Polish
Czech
Slovakian
Flemish
Croatian
Portu
guese
French
Spanish
Italian
Swedish
Danish
Finnish
Estonian
Dutch
German
Hun
garian
one en jeden jeden jeden ien jedan um un uno uno en en yksi uks een erins egy
two to dwa dva dva twie dva dois deux dos due tva to kaksi kaks twee zwei ketto
three tre trzy tri tri drie tri tres trois tres tre tre tre kolme kolme drie drie harom
four fire cztery ctyri styri viere cetiri quarto quatre cuatro quattro fyra fire nelja neli vier vier negy
five fem piec pet pat vuvve pet cinco cinq cinco cinque fem fem viisi viis vijf funf ot
six seks szesc sest sest zesse sest seis six seis sei sex seks kuusi kuus zes sechs hatseven sju sediem sedm sedem zevne sedam sete sept siete sette sju syv seitseman seitse zeven sieben het
eight atte osiem osm osem achte osam oito huit ocho otto atta otte kahdeksan kaheksa acht acht nyolc
nine ni dziewiec devet devat negne devet nove neuf nueve nove nio ni yhdeksan uheksa negen neun kilenc
ten ti dziesiec deset desat t iene deset dez dix dies dieci tio ti kymmenen kumme tien zehn tiz
Data Encoding
-
7/29/2019 Fuzzy Lecturer
97/101
97
g Word ~ feature vector encoding 3 first letters
Alphabet ~ 26 letters + 1 symbol BLANKvector encoding:
For example, ONE : O~14 N~15 E~05
ALPHABET INDEX
BLANK 00
A 01
B 02
C 03D 04
X 24
Y 25
Z 26
Word Encoding (contd)
-
7/29/2019 Fuzzy Lecturer
98/101
98
Word 27-dimensional feature vector
Encoding isinsensitive to order (of 3 letters)
Encoding of 10-word set: concatenate featurevectors of all words: one + two + + ten
word set encoded as vector of dim. [1 X 270]
one (Word)
15 14 05 (Indices)
0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
SOM Modeling Approach2 Di i l SOM (B t h Al ith )
-
7/29/2019 Fuzzy Lecturer
99/101
99
2-Dimensional SOM (Batch Algorithm)Number of Knots per dimension=4
Initial Neighborhood =1 Final Neighborhood = 0.15
Total Number of Iterations= 70
OUTLINE
-
7/29/2019 Fuzzy Lecturer
100/101
100
OUTLINE
Objectives
Brief history and motivation for artificialneural networks
Sequential estimation of modelparameters
Methods for supervised learning
Methods for unsupervised learning Summary and discussion
Summary and Discussion
-
7/29/2019 Fuzzy Lecturer
101/101
Summary and Discussion
Neural Network methods (vs statistical
approaches):- new techniques/ new insights
- simple (brute-force) computational approaches
- biological motivation
The same fundamental issues: small-sampleproblems, curse-of-dimensionality, non-linearoptimization, complexity control
Neural network methods implement risk-minimization (predictive learning setting)
Hype and controversy