fuzzy lecturer

7/29/2019 Fuzzy Lecturer

1/101

1

Introduction toPredictive Learning

Electrical and Computer Engineering

LECTURESET 6

Neural Network Learning


2/101

2

OUTLINE Objectives

- introduce biologically inspired NN learning methods forclustering, regression and classification

- explain similarities and differences between statistical and NNmethods

- showexamples using synthetic and real-life data Brief history and motivation for artificial

neural networks

Sequential estimation of model parameters

Methods for supervised learning Methods for unsupervised learning

Summary and discussion


3/101

3

Brief history and motivation for ANN

Huge interest in understanding the nature and

mechanism ofbiological/ human learning Biologists + psychologists do not adopt classical

parametric statistical learning, because:

- parametric modeling is not biologically plausible

- biological info processing is clearly different from

algorithmic models of computation Mid 1980s: growing interest in applying biologically

inspired computational models to:

- developing computer models (of human brain)

- various engineering applications

New fieldArtificial Neural Networks (~1986 1987)

ANNs representnonlinear estimators implementingthe ERM approach (usually squared-loss function)


4/101

4

History and motivation (contd) Relationship to the problem ofinductive learning:

The same learning problem setting

Neural-style learning algorithm:

- on-line (flow through)

- simple processing

Biological terminology

Generator

of samples

Learning

Machine

System

x

y

y

x y

w ~ xy

w

Hebbian Rule:

Synapse


5/101

5

Neural vs Algorithmic computation

Biological systems do not use principles of

digital circuitsDigital Biological

Connectivity 1~10 ~10,000

Signal digital analogTiming synchronous asynchronous

Signal propag. feedforward feedback

Redundancy no yes

Parallel proc. no yes

Learning no yes

Noise tolerance no yes


6/101

6

Neural vs Algorithmic computation

Computers excel at algorithmic tasks (well-

posed mathematical problems) Biological systems are superior to digital

systems forill-posed problems with noisy data

Example: object recognition [Hopfield, 1987]

PIGEON:~ 10^^9 neurons, cycle time ~ 0.1 sec,each neuron sends 2 bits to ~ 1K other neurons

2x10^^13 bit operations per sec

OLD PC: ~ 10^^7 gates, cycle time 10^^-7, connectivity=2

10x10^^14 bit operations per sec

Both have similar raw processing capability, but pigeonsare better at recognition tasks


7/101

7

Neural terminology and artificial neurons

Some general descriptions of ANNs:http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html

http://en.wikipedia.org/wiki/Neural_network

McCulloch-Pitts neuron (1943)

Threshold (indicator) function of weighted sum of inputs
http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.htmlhttp://en.wikipedia.org/wiki/Neural_networkhttp://en.wikipedia.org/wiki/Neural_networkhttp://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html


8/101

8

Goals of ANNs

Develop models of computation inspired bybiological systems

Study computational capabilities of networks

of interconnected neurons Apply these models to real-life applications

Learning in NNs = modification (adaptation) ofsynaptic connections (weights) in response to

external inputs


9/101

9

Historical highlights of ANN

1943 McCulloch-Pitts neuron1949 Hebbian learning

1960s Rosenblatt (perceptron), Widrow

60s-70s dominance of hard AI1980s resurgence of interest (PDP

group, MLP etc.)

1990s connection to statistics/VC-theory2000s mature field/ unnecessary

fragmentation


10/101

10

OUTLINE

Objectives Brief history and motivation for

artificial neural networks

Sequential estimation of modelparameters

Methods for supervised learning

Methods for unsupervised learning

Summary and Discussion


11/101

11


Batch vs on-line (iterative) learning

- Algorithmic (statistical) approaches ~ batch- Neural-network inspired methods ~ on-line

BUTthe difference is only on the implementation level (soboth types of learning methods should yield similar

generalization performance)

Recall ERM inductive principle (for regression):

Assume dictionary parameterization with fixed basis fcts

n

i

n

i

iiiiemp fy

n

yL

n

R1 1

2,

1,,

1wxwxw

y f x,w wj gj x j 1

m


12/101

12

Sequential (on-line) least squares minimization

Training pairs presented sequentially

On-line update equations for minimizingempirical risk (MSE) wrt parameters w are:

(gradient descent learning)

wherethe gradient is computed via the chain rule:

thelearning rate is a small positive value(decreasing with k)

)(),( kykx

wxw

ww ,,1 kykLkk k

wjLx

,y,w

L

y

y

wj 2

y y gjx

k


13/101


14/101

14

Neural network interpretation of delta rule

Forward pass Backward pass

1 z1

k zm k

y k

w0 k w1

k wm k

1 z1

k zm k

k y k y k

wj k k k zj k w

j

k 1

wj

k

w

j

k

Biological learning

x y

w ~ xy

w

Hebbian Rule:

Synapse


15/101

15

Theoretical basis for on-line learning Standard inductive learning: given training

data find the model providing minofprediction risk

Stochastic Approximation guaranteesminimization of risk (asymptotically):

under general conditions

on the learning rate:

R L z, p z dzz1,...,zn

k 1 k k grad L zk, k limk

k 0

kk1

k2

k

1


16/101

16

Practical issues for on-line learning Given finite training set (nsamples):

this set is presented sequentially to a learning algorithmmany times. Each presentation ofnsamples is calledan epoch, and the process of repeated presentations is

called recycling (of training data)

Learning rate schedule: initially set large, then slowlydecreasing with k (iteration number). Typically goodlearning rate schedules are data-dependent.

Stopping conditions:

(1) monitor the gradient (i.e., stop when the gradient

falls below some small threshold)

(2) early stopping can be used for complexity control

z1,...,zn


17/101

17

OUTLINE

Objectives Brief history and motivation for artificialneural networks


Methods for supervised learning- MultiLayer Perceptron (MLP) networks

- Radial Basis Function (RBF) Networks

Methods for unsupervised learning

Summary and discussion


18/101

18

Multilayer Perceptrons (MLP)

Recall graphical NN

representation for

dictionary methods:

where

How to estimate parameters (weights) via ERM?

W is m 1

1 2 m

V is d m

y wjzjj 1

m

zj g x,vj

x1 x2 xd

z1 z2 zm

gx,vi s vi 0 xkvikk1

d

s x v i

s t 11 exp t

s t tanh t exp t exp t exp t exp t


19/101

19

Learning for a single neuron (delta rule):

Forward pass Backward pass

1 z1

k zm k

y k

w0 k w1

k wm k

1 z1

k zm k

k y k y k

wj k k k zj k w

j

k 1 wj

k wj

k

How to implement gradient-descent learning in anetwork of neurons?


20/101

20

Backpropagation training

Minimization ofwith respect to parameters (weights) W, V

Gradient descent optimization for

where

Careful application of gradient descent leads

leads to backpropagation algorithm

n

iiiemp yfR 1

2

,,VWx

V k 1 V k k gradV L x k ,y k ,V k ,w k w k 1 w k k gradw L x k ,y k ,V k ,w k

k1,...,

n,...

L x k ,y k ,V k ,w k 1

2 f x,w,V y 2


21/101

21

Backpropagation: forward pass

for training input x(k), estimate predicted output ky


22/101

22

Backpropagation: backward passupdate the weights by propagating the error


23/101

23

Details of backpropagation

Sigmoid activation - picture? simple derivative

Poor behaviour for large t~ saturation

How to avoid saturation?- Proper initialization (small weights)

- Pre-scaling of inputs (zero mean, unit variance)

Learning rate schedule (initial, final) Stopping rules, number of epochs

Number of hidden units

s t 1

1 exp t s t s t 1 s t


24/101

24

Additional enhancements

The problem: convergence may be very slowfor error functional with different curvatures:

Solution: add momentum term tosmoothoscillations

where and is momentum parameter

w k1 w k k z k w k

w k w k w k1


25/101

25

Regularization Effect of Backpropagation

Backpropagation ~ iterative optimization

Final model (weights) depends on:- initial point + final point (stopping rules)

initialization and/ or stopping rules can be usedfor model complexity control


26/101

26

Various forms of complexity control

MLP topology ~ number of hidden units

Constraints on parameters (weights) ~weight decay

Type of optimization algorithm (many

versions of backprop., other opt. methods) Stopping rules

Initial conditions(initial small weights)

Multiple factors make it difficult to controlcomplexity; usually vary one complexityparameter while keeping all others fixed


27/101


28/101

28

Example: univariate regression

Data set: 30 samples generatedusing sine-squared

target function with Gaussian noise (st. deviation 0.1). MLP network

(five hidden units)

near optimal

0 0.2 0.4 0.6 0.8 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2

X

Y


29/101

29

Example: univariate regression


target function with Gaussian noise (st. deviation 0.1). MLP network

(20 hidden units)

little overfitting

0 0.2 0.4 0.6 0.8 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2

X

Y


30/101

30

Backpropagation for classification

Original MLP is forregression

(as shown)

Forclassification:

- use sigmoid output unit- during training, use real-values 0/1 for class labels

- during operation, threshold the output of a trained MLP

classifier at 0.5 to predict class labels

W is m 1

1 2 m

V is d m

y wjzjj 1

m

zj gx,vj

x1 x2 xd

z1 z2 zm


31/101

31

Classification example (Ripleys data set)

Data set: 250 samples ~ mixture of gaussians, where

Class 0data has centers (-0.3, 0.7) and (0.4, 0.7), andClass 1 data has centers (-0.7, 0.3) and (0.3, 0.3).

The variance of all gaussians is 0.03.

MLP classifier

(two hidden units)

underfitting

-1.5 -1 -0.5 0 0.5 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2


32/101


33/101

33

Classification Example

MLP classifier(six hidden units)

some overfitting

-1.5 -1 -0.5 0 0.5 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2


34/101

34

MLP software

MLP software widely available in public domain

For example, Netlab toolbox (in Matlab) athttp://www1.aston.ac.uk/eas/research/groups/n

crg/resources/netlab/

Many commercial products (full of Neural

Network marketing hype), i.e.

Nearly 80% Accurate Market Forecasting Software

Get FREE up to date predictions and see for yourself!

NetTalk
http://www1.aston.ac.uk/eas/research/groups/ncrg/resources/netlab/http://www1.aston.ac.uk/eas/research/groups/ncrg/resources/netlab/http://www1.aston.ac.uk/eas/research/groups/ncrg/resources/netlab/http://www1.aston.ac.uk/eas/research/groups/ncrg/resources/netlab/


35/101

35

NetTalk(Sejnowski and Rosenberg, 1987)

One of the first successful applications of backpropagation:

http://www.cnl.salk.edu/ParallelNetsPronounce/index.php Goal: Learning to read (English text) aloud, i.e.

Learn Mapping:English textphonemes

using MLP network

Network inputs encode 7-letter window (the 4-th letterin the middle needs to be pronounced)

Network outputs (26 units) encode phonemes thatdrive a speech synthesizer The MLP network is trained using labeled data (both

individual words and unrestricted text)

N tT lk hit t
http://www.cnl.salk.edu/ParallelNetsPronounce/index.phphttp://www.cnl.salk.edu/ParallelNetsPronounce/index.php


36/101

36

NetTalk architecture

Input encoding: 7x29 = 203 units

Output encoding: 26 units (phonemes)Hidden layer: 80 hidden units

Li t i t N tT lk t d h


37/101

37

Listening to NetTalk-generated speech

Listen to tape recordings illustrating NETtalk operation. These recordings

are available (in MP3 format) from an article in Wikipedia at

http://en.wikipedia.org/wiki/NETtalk_(artificial_neural_network)

This article has a link to the audio examples of the neural network as it

progresses through training. Specifically, it hasthree recordings contain 3

different audio outputs of NETtalk:

(a) during the first 5 minutes of training, starting with weights initialized to zero.

(b) after training using the set of 10,000 words. This training set corresponds to 20

passes (epochs) over 500-word text.

(c) generated with new text input from transcription that was not part of the training

set.

After listening to these recordings, answer and comment on the following questions:- can you recognize words in the beginning of recording (a)? in the end of (a)?

- compare the quality of outputs (b) and (c). Which one seems closer to human

speech and why?
http://en.wikipedia.org/wiki/NETtalk_(artificial_neural_network)http://en.wikipedia.org/wiki/NETtalk_(artificial_neural_network)


38/101

38

NETtalk: question for discussion

NETtalk system uses a seven-letter window for

text input. Try to justify this choice (of window

size) based on the properties of natural English

language. How the performance of NETtalk would

change if a small window (of size 3 letters) or alarge window (of size 21 letters) is used instead?


39/101

39

Radial Basis Function (RBF) networks

Dictionary parameterization:

- each b.f. is (usually) local

- center and width

i.e. Gaussian:

Typically used for regression or classification

W is m 1

1 2 m

V is d m

y wjz

jj 1

m

zj gx,vj

x1 x2 xd

z1 z2 zm

01

g wwfm

j j

j

jm

vxx

2

2

12

2

2exp

2expg

vxx

d

j

jj vx

vj j


40/101

40

RBF network training

RBF training (learning) ~ estimation of

(1) RBF parameters (centers, width)

(2) linear weightsws

Non-adaptive implementation:

(1) Estimate RBF parameters via unsupervised learning(only x-values of training data) can use SOM, GLA etc.

(2) Estimate weights w via linear least squares

Advantages:

- fast training;

- when x-samples are plenty, but (x,y) data are few

Limitations: cannot discard irrelevant inputs

the curse of dimensionalty


41/101

41

Non-adaptive RBF training algorithm1. Choose the number of basis functions

(centers)m.2. Estimate centers using x-values of training data

via unsupervised learning (SOM, GLA, clustering etc.)

3. Determine width parameters using heuristic:

For a given center(a) find the distance to the closest center:

for all

(b) set the width parameter

where parameter controls degree of overlap betweenadjacent basis functions. Typically

4. Estimate weights wvia linear least squares(minimization of the empirical risk).

vj

vj

j

rj minkvk vj k j

j rj

1 3


42/101

42

RBF network complexity control

RBF model complexity can be controlled by The number of RBFs:

Goal: select opt number of units (RBFs)

RBF width:Goal: select opt width parameter (for largenumber of RBFs)

Penalization of large weights

ws

See toy examples next (using the number ofunits as the complexity parameter)


43/101

43

Example: RBF regression


target function with Gaussian noise (st. deviation 0.1). RBF network: automatic width selection

2 RBFs

underfitting

0 0.2 0.4 0.6 0.8 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2

X

Y


44/101

44




5 RBFs

~ optimal

0 0.2 0.4 0.6 0.8 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2

X

Y


45/101

45




20 RBFs

overfitting

0 0.2 0.4 0.6 0.8 1-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

X

Y


46/101

46

RBF Classification example (Ripleys data)

Data set: 250 samples ~ mixture of gaussians, where

Class 0data has centers (-0.3, 0.7) and (0.4, 0.7), andClass 1 data has centers (-0.7, 0.3) and (0.3, 0.3).

The variance of all gaussians is 0.03.

RBF classifier

(4 units)

little underfitting

-1.5 -1 -0.5 0 0.5 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2


47/101


48/101


49/101


50/101

50

Overview

Recall from Lecture Set 2:

unsupervised learning

data reduction approach

Example:Training data represented by 3 centers

H


51/101


52/101

52

2. Dimensionality reduction:linear nonlinear

Note: the goal is to estimate a mapping from d-dimensionalinput space (d=2) to low-dim. feature space (m=1)

R x f x, 2p x dx

x1

x2


53/101

53

Unsupervised Learning: Formalization

Unsupervised learning

~ mapping from the input space (x) to somemodel space

For VQ/ clustering:

a model is a set of centers (cluster centers) For dimensionality reduction:

a model is a low-dim. Space

Note 1: two types of problems can be combined

Note 2:unsupervised learning requiresestimation of two mappings xz x*

z = F(x) and x* = G(z)


54/101


55/101

55

Generalized Lloyd Algorithm(GLA) for VQ

Givendata points , loss function L (i.e.,

squared loss) and initial centers

Performthe following updates upon presentation of1. Find the nearest center to the data point (the

winning unit):

2. Update the winning unit coordinates (only) via

Increment kand iterate steps (1) (2) aboveNote:- the learning rate decreases with iteration numberk

-biological interpretations of steps (1)-(2) exist

cj 0 j 1,...,mx k k 1,2,...

x k

kkj ii

cx minarg

kkkkk jjj cxcc 1

Batch version of GLA


56/101

56

Batch version of GLAGivendata points , loss function L (i.e.,


Iteratethe following two steps1. Partition the data (assign sample to unitj)using the nearest neighbor rule. Partitioning matrix Q:

2. Update unit coordinates as centroids of the data:

Note:final solution may depend on initialization (local min) potential problem for both on-line and batch GLA

cj 0 j 1,...,mxi i 1,...,n

qij 1 if L x i,cj k

min

l

L x i,cl k 0 otherwise

cj k 1

qijx ii1

n

qiji 1

n

, j 1, . .. ,m

xi

Statistical Interpretation of GLA


57/101

57

Statistical Interpretation of GLAIteratethe following two steps

1. Partition the data (assign sample to unitj) usingthe nearest neighbor rule. Partitioning matrix Q:

~ Projection of the data onto model space(units) F(x)

2. Update unit coordinates as centroids of the data:

~Conditional Expectation (averaging, smoothing) G(z)conditional upon results of partitioning step (1)

qij 1 if L x i,cj k min

lL x i,cl k

0 otherwise

cj k 1

qijx ii1

n

qiji 1

n

, j 1, . .. ,m

xi


58/101

58

Numeric Example of univariate VQGiven data: {2,4,10,12,3,20,30,11,25}, set m=2

Initialization (random): c1

=3,c2

=4

Iteration 1

Projection: P1={2,3} P2={4,10,12,20,30,11,25}

Expectation (averaging): c1=2.5, c2=16

Iteration 2

Projection: P1={2,3,4}, P2={10,12,20,30,11,25}Expectation(averaging): c1=3, c2=18

Iteration 3

Projection: P1={2,3,4,10},P2={12,20,30,11,25}Expectation(averaging): c1=4.75, c2=19.6

Iteration 4Projection: P1={2,3,4,10,11,12}, P2={20,30,25}Expectation(averaging): c1=7, c2=25

Stop as the algorithm is stabilized with these values


59/101


60/101

60

GLA Example 2 Modeling doughnut distribution using 20 units:

7 units were never moved by the GLA

the problem of unused units (dead units)


61/101

61

Avoiding local minima with GLA Starting with many random initializations,

and then choosing the best GLA solution Conscience mechanism: forcing dead

units to participate in competition, by keepingthe frequency count (of past winnings) foreach unit,

i.e. for on-line version of GLA in Step 1

Self-Organizing Map: introduce topologicalrelationship (map), thus forcing the neighborsof the winning unit to move towards the data.

)(minarg kfreqkkj iii

cx


62/101

62

Clustering methods

Clustering: separating a data set intoseveral groups (clusters) according to somemeasure of similarity

Goals of clustering:

interpretation (of resulting clusters)exploratory data analysis

preprocessing for supervised learning

often the goal is not formally stated

VQ-style methods (GLA) often used forclustering, i.e. k-meansorc-means

Many other clustering methods as well


63/101

63

Clustering (contd) Clustering: partition a set of n objects

(samples) into k disjoint groups, based onsome similarity measure. Assumptions:

- similarity ~ distance metric dist (i,j)- usually k given a priori (but not always!)

Intuitive motivation:similar objects into one clusterdissimilar objects into different clusters

the goal is not formally stated

Similarity (distance) measure is criticalbutusually hard to define (objectively).Distance needs to be defined fordifferenttypes of input variables.


64/101

64

Applications of clustering

Marketing:explore customers data to identify buying

patterns for targeted marketing (Amazon.com)

Economic data:

identify similarity between different countries,

states, regions, companies, mutual funds etc.

Web data:

cluster web pages or web users to discovergroups of similar access patterns

Etc., etc.


65/101

65

Clustering Methods Many different approaches developed in

- neural networks

- mathematics (graph theory, linearalgebra)

- pattern recognition- data mining etc.

Example Graph-theoretic approach:

Minimum Spanning Tree (MST) clustering Types of clustering methods:

hierarchical, partitional, fuzzy clustering.


66/101

66

K-means clustering (~ GLA)

This is representative Partitional Clustering method.

Given a data set ofnsamples and the value of k:

Step 0: (arbitrarily) initialize cluster centers

Step 1: assign each data point (object) to the cluster

with the closest cluster centerStep 2: calculate the mean (centroid) of data points

in each cluster as estimated cluster centers

Iterate steps 1 and 2 until the cluster membership isstabilized

xi


67/101

67

The K-MeansClustering Method

Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily choose Kobject as initialcluster center

Assigneachobjectsto mostsimilarcenter

Updatethe

clustermeans

Updatetheclustermeans

reassign

reassign


68/101

68

Clustering of High-Dimensional Data

Additional Challenges

- many clustering methods rely on intuition for

low-dimensional data

- visualization is possible only in 2D or 3D.

Multidimensional scaling (MDS)aims to produce 2D representation of inter-point

distances between high-dimensional samples.

MDS finds a set of points in 2D space,which minimizes the stress function

ij

ji

jiijnmS2

21 ,,, zzzzz

nzzZ ,...,1


69/101

69

Self-Organizing MapsHistory and biological motivation

Brain changes its internal structure to reflectlife experiences interaction withenvironment is critical at early stages ofbrain development (first 1-2 years of life)

Existence of various regions (maps) in thebrain

How these maps may be formed?

i.e. information-processing model leading tomap formation

T. Kohonen (early 1980s) proposed SOM

Goal of SOM


70/101

70

Goal of SOM

Dimensionality reduction: project given (high-

dimensional) data onto low-dimensional space (map) Feature space (Z-space) is 1D or 2D and is discretizedas a number of units, i.e., 10x10 map

Z-space has distance metricordering of units Similarities and differences between VQ and SOM

X F(Z)ZG(X)

X


71/101

71

Self-Organizing MapDiscretization of 2D space via 10x10 map. In this discrete

space, distance relations exist between all pairs of units.Distance relation ~ map topology

Units in 2D feature space

SOM Algorithm (flow through)


72/101

72

SOM Algorithm (flow through)

Givendata points , distance metric in the

input space (~ Euclidean), map topology (in z-space),initial position of units (in x-space)

Performthe following updates upon presentation of

1. Find the nearest center to the data point (thewinning unit):

2. Update all units around the winning unit via

Increment k,decrease the learning rate and theneighborhood width,and repeat steps (1) (2) above

cj 0 j 1,...,m

x k k 1,2,...

x k

1minarg)(* kkk ii

cxz

1*,1 kkKkkk jkjj cxzzcc

SOM l (1 i i )


73/101

73

SOM example (1-st iteration)

Step 1:

Step 2:

SOM l ( t it ti )


74/101

74

SOM example (next iteration)

Step 1:

Step 2:

Final map

Hyper-parameters of SOM


75/101

75

Hyper parameters of SOM

SOM performance depends on parameters (~ user-defined):

Map dimension and topology (usually 1D or 2D) Number of SOM units ~ quantization level (ofz-space) Neighborhood function ~ rectangular or gaussian (not

important)

Neighborhood width decrease schedule (important),

i.e. exponential decrease for Gaussian

with user defined:

Also linear decrease of neighborhood width

Learning rate schedule (important)(also linear decrease)

Note: learning rate and neighborhood decrease should beset jointly

kK k 2

2

2

zzzz exp, maxkkinitialfinalinitialk

maxk initial final

maxk

k

initial

finalinitialk

M d li if di t ib ti i SOM


76/101

76

Modeling uniform distribution via SOM

(a) 300 random samples (b) 10X10 map

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SOM neighborhood: Gaussian

Learning rate: linear decrease )/1(1.0)( maxkkk

Position of SOM units: (a) initial, (b) after 50 iterations,( ) f 100 i i (d) f 10 000 i i


77/101

77

(c) after 100 iterations, (d) after 10,000 iterations

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Batch SOM (similar to batch GLA)


78/101

78

Batch SOM (similar to batch GLA)Givendata points , loss function L (i.e.,


Iteratethe following stepsPartition the data (assign sample to unitj) using the

nearest neighbor rule. Partitioning matrix Q:

Update unit coordinates as weighted average of all samples:

where is the weight of sample

Decrease the neighborhood widthIterate (repeat)above steps max number of iterations Kmax

cj 0 j 1,...,mxi i 1,...,n

qij 1 if L x i,cj k min

lL x i,cl k

0 otherwise

xi

n

i

ij

n

i

iji

j

K

K

kc

1

1

,

,

)1(

zz

zzx

ix),( ijK zz

SOM Example 1


79/101

79

SOM Example 1

Modeling doughnut distribution using batch SOM linear (1D) SOM topology with 5 units

- initial neighborhood =1, final neighborhood = 0.05

- number of iterations Kmax=50

Note: no unused units:

-1.5 -1 -0.5 0 0.5 1 1.5-1.5

-1

-0.5

0

0.5

1

1.5

X1

X2

SOM Example 2


80/101

80

SOM Example 2 Modeling the same doughnut distribution using:

square grid (2D) SOM topology with 5x5=25 units

- initial neighborhood =1, final neighborhood = 0.05

- number of iterations Kmax=50

Note:final model not sensitive

to poor initialization

-1.5 -1 -0.5 0 0.5 1 1.5-1.5

-1

-0.5

0

0.5

1

1.5

X1

X2

SOM A li ti d V i ti


81/101

81

SOM Applications and Variations

Main web site: Helsinki University of Technology (HUT)http://www.cis.hut.fi/research/som-research/

Numerous Applications

Marketing surveys/ segmentation

Financial/ stock market data

Text data / document map WEBSOM

Image data / picture map - PicSOMsee HUT web site

Practical Issues for SOM
http://www.cis.hut.fi/research/som-research/http://www.cis.hut.fi/research/som-research/http://www.cis.hut.fi/research/som-research/http://www.cis.hut.fi/research/som-research/


82/101

82

Practical Issues for SOM Pre-scaling of inputs, usually to [0, 1]

range. Why? Map topology: usually 1D or 2D

Number of map units (per dimension)

Learning rate schedule (for on-lineversion)

Neighborhood type and schedule:

Initial size (~1), final sizeFinal neighborhood size + number ofunits determine model complexity.

Modeling US states using 1D SOM


83/101

83

Modeling US states using 1D SOM

Purpose: clustering of US states

Data encoding: each state described by 5socio-economic indicators: obesity index,

result of 2004 presidential elections, median

income, mean NAEP, IQ score Data scaling: each input scaled

independently to [0,1] range

SOM specs: 1D map, 9 units, initialneighborhood width 1, final width 0.05

State Obesity index Election 04 Median Income Mean NAEP IQ score


84/101

84

State Obesity index Election_04 Median Income Mean NAEP IQ score

Hawaii 17 0 49775 238 94

Colorado 17 1 49617 252 104

Connecticut 18 0 53325 255 99

Massachusetts 18 0 50587 257 111

New Hampshire 18 1 53549 257 102Utah 18 1 48537 250 89

California 19 0 48113 238 94

Maryland 19 0 55912 248 95

New Jersey 19 0 53266 253 103

Rhode Island 19 0 44311 245 89

Vermont 19 0 41929 256 102

Florida 19 1 38533 245 87

Montana 19 1 33900 254 100

Oregon 20 0 42704 250 100

Arizona 20 1 41554 241 92

Idaho 20 1 38613 249 96

New Mexico 20 0 35251 235 85

Wyoming 20 1 40499 253 102

Maine 21 0 37654 253 99

New York 21 0 42432 251 90

Washington 21 0 44252 251 92

South Dakota 21 1 38755 254 100Delaware 22 0 50878 250 90

Illinois 22 0 45906 248 93

Minnesota 22 0 54931 256 113

Wisconsin 22 0 46351 252 105

Nevada 22 1 46289 239 92

Alaska 23 1 55412 245 92


85/101

85

Iowa 23 0 41827 253 109

Kansas 23 1 42523 253 101

Missouri 23 1 43955 251 92Nebraska 23 1 43566 251 101

North Dakota 23 1 36717 254 111

Ohio 23 1 43332 252 107

Oklahoma 23 1 35500 244 98

Pennsylvania 24 0 43577 249 99

Arkansas 24 1 32423 242 98

Georgia 24 1 43316 243 93

Indiana 24 1 41581 251 105

North Carolina 24 1 38432 252 106

Virginia 24 1 49974 253 99

Michigan 25 0 45335 249 99

Kentucky 25 1 37893 247 94

Tennessee 25 1 36329 241 90

Alabama 26 1 36771 236 90

Louisiana 26 1 33312 238 99

South Carolina 26 1 38460 246 87

Texas 26 1 40659 247 98Mississippi 27 1 32447 236 90

West Virginia 28 1 30072 245 92

SOM Modeling 1 of US states


86/101

86

SOM Modeling 1 of US statesUnit States (assigned to each unit)

1 HI, CA, MD, RI, NM,

2 OR, ME, NY, WA, DE, IL, PA, MI,

3 CT, MA, NJ, VT, MN, WI,

4

5 CO, NH, MT, WY, SD,

6 KS, NE, ND, OH, IN, NC, VA,

7 UT, ID, AK, IA, MO,

8 FL, AZ, NV, OK, GA, KY, TX

9 AR, TN, AL, LA, SC, MS, WV


87/101

87

SOM Modeling 2 of US states


88/101

88

- remove voting input and apply 1D SOM:Unit States

1 CO, CT, MA, NH, NJ, MN,

2 WI, IA, ND, OH, IN, NC,

3 VT, MT, OR, ID, WY, ME, SD,

4 KS, MO, NE, PA, VA, MI,

5 UT, MD, NY, WA, DE, IL, AK,

6 HI, CA , RI,

7 FL, AZ, NM, NV,

8 OK, GA, KY, SC, TX,

9 AR, TN, AL, LA, MS, WV

SOM Modeling 2 of US states (contd)


89/101

89

- remove voting input and apply 1D SOM:

Tree structured SOM


90/101

90

Tree-structured SOM

Fixed SOM topologygives poor modeling of

structured distributions:

Minimum SpanningTree SOM


91/101

91

Minimum SpanningTree SOM

Define SOM topology adaptively during each iteration

of SOM algorithm Minimum Spanning Tree (MST) topology ~ according

to distance between units (in the input space)

Topological distance ~ number of hops in MST

123

Example of using MST SOM


92/101

92

Example of using MST SOM

Modeling cross distribution

MST topology vs fixed 2D grid map

Application: skeletonization of images


93/101

93

Singh at al, Self-organizing maps for the skeletonization of sparse

shapes, IEEE Trans Neural Networks, 11, Issue 1, Jan 2000

Skeletonization of noisy images

Application of MST SOM: robustness with

respect to noise

Clustering of European Languages


94/101

94

g p g g

Background historical linguistics studiesrelatedness btwn languages based on

phonology, morphology, syntax and lexicon

Difficulty of the problem:due to evolving

nature of human languages and globalization.

Hypothesis: similarity based on analysis of asmall stable word set.

See glottochronology, Swadesh list, athttp://en.wikipedia.org/wiki/Glottochronology

SOM for clustering European languages
http://en.wikipedia.org/wiki/Glottochronologyhttp://en.wikipedia.org/wiki/Glottochronology


95/101

95

g p g gModeling approach: language ~ 10 word set.

Assuming words in different languages are encodedin the same alphabet, it is possible to performclustering using some distance measure.

Issues:

selection of stable word set

data encoding + distance metric

Stable word set: numbers 1 to 10

Data encoding: Latin (English) alphabet,use 3 first letters (in each word)

Numbers word set in 18 European languages


96/101

96

p g g

Each language is a feature vector encoding 10 words

En

glish

Norwegia

n

Polish

Czech

Slovakian

Flemish

Croatian

Portu

guese

French

Spanish

Italian

Swedish

Danish

Finnish

Estonian

Dutch

German

Hun

garian

one en jeden jeden jeden ien jedan um un uno uno en en yksi uks een erins egy

two to dwa dva dva twie dva dois deux dos due tva to kaksi kaks twee zwei ketto

three tre trzy tri tri drie tri tres trois tres tre tre tre kolme kolme drie drie harom

four fire cztery ctyri styri viere cetiri quarto quatre cuatro quattro fyra fire nelja neli vier vier negy

five fem piec pet pat vuvve pet cinco cinq cinco cinque fem fem viisi viis vijf funf ot

six seks szesc sest sest zesse sest seis six seis sei sex seks kuusi kuus zes sechs hatseven sju sediem sedm sedem zevne sedam sete sept siete sette sju syv seitseman seitse zeven sieben het

eight atte osiem osm osem achte osam oito huit ocho otto atta otte kahdeksan kaheksa acht acht nyolc

nine ni dziewiec devet devat negne devet nove neuf nueve nove nio ni yhdeksan uheksa negen neun kilenc

ten ti dziesiec deset desat t iene deset dez dix dies dieci tio ti kymmenen kumme tien zehn tiz

Data Encoding


97/101

97

g Word ~ feature vector encoding 3 first letters

Alphabet ~ 26 letters + 1 symbol BLANKvector encoding:

For example, ONE : O~14 N~15 E~05

ALPHABET INDEX

BLANK 00

A 01

B 02

C 03D 04

X 24

Y 25

Z 26

Word Encoding (contd)


98/101

98

Word 27-dimensional feature vector

Encoding isinsensitive to order (of 3 letters)

Encoding of 10-word set: concatenate featurevectors of all words: one + two + + ten

word set encoded as vector of dim. [1 X 270]

one (Word)

15 14 05 (Indices)

0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

SOM Modeling Approach2 Di i l SOM (B t h Al ith )


99/101

99

2-Dimensional SOM (Batch Algorithm)Number of Knots per dimension=4

Initial Neighborhood =1 Final Neighborhood = 0.15

Total Number of Iterations= 70

OUTLINE


100/101

100

OUTLINE

Objectives

Brief history and motivation for artificialneural networks

Sequential estimation of modelparameters

Methods for supervised learning

Methods for unsupervised learning Summary and discussion



101/101


Neural Network methods (vs statistical

approaches):- new techniques/ new insights

- simple (brute-force) computational approaches

- biological motivation

The same fundamental issues: small-sampleproblems, curse-of-dimensionality, non-linearoptimization, complexity control

Neural network methods implement risk-minimization (predictive learning setting)

Hype and controversy

fuzzy lecturer

Documents