fuzzy lecturer

Upload: vivek-kumar

Post on 03-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Fuzzy Lecturer

    1/101

    1

    Introduction toPredictive Learning

    Electrical and Computer Engineering

    LECTURESET 6

    Neural Network Learning

  • 7/29/2019 Fuzzy Lecturer

    2/101

    2

    OUTLINE Objectives

    - introduce biologically inspired NN learning methods forclustering, regression and classification

    - explain similarities and differences between statistical and NNmethods

    - showexamples using synthetic and real-life data Brief history and motivation for artificial

    neural networks

    Sequential estimation of model parameters

    Methods for supervised learning Methods for unsupervised learning

    Summary and discussion

  • 7/29/2019 Fuzzy Lecturer

    3/101

    3

    Brief history and motivation for ANN

    Huge interest in understanding the nature and

    mechanism ofbiological/ human learning Biologists + psychologists do not adopt classical

    parametric statistical learning, because:

    - parametric modeling is not biologically plausible

    - biological info processing is clearly different from

    algorithmic models of computation Mid 1980s: growing interest in applying biologically

    inspired computational models to:

    - developing computer models (of human brain)

    - various engineering applications

    New fieldArtificial Neural Networks (~1986 1987)

    ANNs representnonlinear estimators implementingthe ERM approach (usually squared-loss function)

  • 7/29/2019 Fuzzy Lecturer

    4/101

    4

    History and motivation (contd) Relationship to the problem ofinductive learning:

    The same learning problem setting

    Neural-style learning algorithm:

    - on-line (flow through)

    - simple processing

    Biological terminology

    Generator

    of samples

    Learning

    Machine

    System

    x

    y

    y

    x y

    w ~ xy

    w

    Hebbian Rule:

    Synapse

  • 7/29/2019 Fuzzy Lecturer

    5/101

    5

    Neural vs Algorithmic computation

    Biological systems do not use principles of

    digital circuitsDigital Biological

    Connectivity 1~10 ~10,000

    Signal digital analogTiming synchronous asynchronous

    Signal propag. feedforward feedback

    Redundancy no yes

    Parallel proc. no yes

    Learning no yes

    Noise tolerance no yes

  • 7/29/2019 Fuzzy Lecturer

    6/101

    6

    Neural vs Algorithmic computation

    Computers excel at algorithmic tasks (well-

    posed mathematical problems) Biological systems are superior to digital

    systems forill-posed problems with noisy data

    Example: object recognition [Hopfield, 1987]

    PIGEON:~ 10^^9 neurons, cycle time ~ 0.1 sec,each neuron sends 2 bits to ~ 1K other neurons

    2x10^^13 bit operations per sec

    OLD PC: ~ 10^^7 gates, cycle time 10^^-7, connectivity=2

    10x10^^14 bit operations per sec

    Both have similar raw processing capability, but pigeonsare better at recognition tasks

  • 7/29/2019 Fuzzy Lecturer

    7/101

    7

    Neural terminology and artificial neurons

    Some general descriptions of ANNs:http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html

    http://en.wikipedia.org/wiki/Neural_network

    McCulloch-Pitts neuron (1943)

    Threshold (indicator) function of weighted sum of inputs

    http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.htmlhttp://en.wikipedia.org/wiki/Neural_networkhttp://en.wikipedia.org/wiki/Neural_networkhttp://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html
  • 7/29/2019 Fuzzy Lecturer

    8/101

    8

    Goals of ANNs

    Develop models of computation inspired bybiological systems

    Study computational capabilities of networks

    of interconnected neurons Apply these models to real-life applications

    Learning in NNs = modification (adaptation) ofsynaptic connections (weights) in response to

    external inputs

  • 7/29/2019 Fuzzy Lecturer

    9/101

    9

    Historical highlights of ANN

    1943 McCulloch-Pitts neuron1949 Hebbian learning

    1960s Rosenblatt (perceptron), Widrow

    60s-70s dominance of hard AI1980s resurgence of interest (PDP

    group, MLP etc.)

    1990s connection to statistics/VC-theory2000s mature field/ unnecessary

    fragmentation

  • 7/29/2019 Fuzzy Lecturer

    10/101

    10

    OUTLINE

    Objectives Brief history and motivation for

    artificial neural networks

    Sequential estimation of modelparameters

    Methods for supervised learning

    Methods for unsupervised learning

    Summary and Discussion

  • 7/29/2019 Fuzzy Lecturer

    11/101

    11

    Sequential estimation of model parameters

    Batch vs on-line (iterative) learning

    - Algorithmic (statistical) approaches ~ batch- Neural-network inspired methods ~ on-line

    BUTthe difference is only on the implementation level (soboth types of learning methods should yield similar

    generalization performance)

    Recall ERM inductive principle (for regression):

    Assume dictionary parameterization with fixed basis fcts

    n

    i

    n

    i

    iiiiemp fy

    n

    yL

    n

    R1 1

    2,

    1,,

    1wxwxw

    y f x,w wj gj x j 1

    m

  • 7/29/2019 Fuzzy Lecturer

    12/101

    12

    Sequential (on-line) least squares minimization

    Training pairs presented sequentially

    On-line update equations for minimizingempirical risk (MSE) wrt parameters w are:

    (gradient descent learning)

    wherethe gradient is computed via the chain rule:

    thelearning rate is a small positive value(decreasing with k)

    )(),( kykx

    wxw

    ww ,,1 kykLkk k

    wjLx

    ,y,w

    L

    y

    y

    wj 2

    y y gjx

    k

  • 7/29/2019 Fuzzy Lecturer

    13/101

  • 7/29/2019 Fuzzy Lecturer

    14/101

    14

    Neural network interpretation of delta rule

    Forward pass Backward pass

    1 z1

    k zm k

    y k

    w0 k w1

    k wm k

    1 z1

    k zm k

    k y k y k

    wj k k k zj k w

    j

    k 1

    wj

    k

    w

    j

    k

    Biological learning

    x y

    w ~ xy

    w

    Hebbian Rule:

    Synapse

  • 7/29/2019 Fuzzy Lecturer

    15/101

    15

    Theoretical basis for on-line learning Standard inductive learning: given training

    data find the model providing minofprediction risk

    Stochastic Approximation guaranteesminimization of risk (asymptotically):

    under general conditions

    on the learning rate:

    R L z, p z dzz1,...,zn

    k 1 k k grad L zk, k limk

    k 0

    kk1

    k2

    k

    1

  • 7/29/2019 Fuzzy Lecturer

    16/101

    16

    Practical issues for on-line learning Given finite training set (nsamples):

    this set is presented sequentially to a learning algorithmmany times. Each presentation ofnsamples is calledan epoch, and the process of repeated presentations is

    called recycling (of training data)

    Learning rate schedule: initially set large, then slowlydecreasing with k (iteration number). Typically goodlearning rate schedules are data-dependent.

    Stopping conditions:

    (1) monitor the gradient (i.e., stop when the gradient

    falls below some small threshold)

    (2) early stopping can be used for complexity control

    z1,...,zn

  • 7/29/2019 Fuzzy Lecturer

    17/101

    17

    OUTLINE

    Objectives Brief history and motivation for artificialneural networks

    Sequential estimation of model parameters

    Methods for supervised learning- MultiLayer Perceptron (MLP) networks

    - Radial Basis Function (RBF) Networks

    Methods for unsupervised learning

    Summary and discussion

  • 7/29/2019 Fuzzy Lecturer

    18/101

    18

    Multilayer Perceptrons (MLP)

    Recall graphical NN

    representation for

    dictionary methods:

    where

    How to estimate parameters (weights) via ERM?

    W is m 1

    1 2 m

    V is d m

    y wjzjj 1

    m

    zj g x,vj

    x1 x2 xd

    z1 z2 zm

    gx,vi s vi 0 xkvikk1

    d

    s x v i

    s t 11 exp t

    s t tanh t exp t exp t exp t exp t

  • 7/29/2019 Fuzzy Lecturer

    19/101

    19

    Learning for a single neuron (delta rule):

    Forward pass Backward pass

    1 z1

    k zm k

    y k

    w0 k w1

    k wm k

    1 z1

    k zm k

    k y k y k

    wj k k k zj k w

    j

    k 1 wj

    k wj

    k

    How to implement gradient-descent learning in anetwork of neurons?

  • 7/29/2019 Fuzzy Lecturer

    20/101

    20

    Backpropagation training

    Minimization ofwith respect to parameters (weights) W, V

    Gradient descent optimization for

    where

    Careful application of gradient descent leads

    leads to backpropagation algorithm

    n

    iiiemp yfR 1

    2

    ,,VWx

    V k 1 V k k gradV L x k ,y k ,V k ,w k w k 1 w k k gradw L x k ,y k ,V k ,w k

    k1,...,

    n,...

    L x k ,y k ,V k ,w k 1

    2 f x,w,V y 2

  • 7/29/2019 Fuzzy Lecturer

    21/101

    21

    Backpropagation: forward pass

    for training input x(k), estimate predicted output ky

  • 7/29/2019 Fuzzy Lecturer

    22/101

    22

    Backpropagation: backward passupdate the weights by propagating the error

  • 7/29/2019 Fuzzy Lecturer

    23/101

    23

    Details of backpropagation

    Sigmoid activation - picture? simple derivative

    Poor behaviour for large t~ saturation

    How to avoid saturation?- Proper initialization (small weights)

    - Pre-scaling of inputs (zero mean, unit variance)

    Learning rate schedule (initial, final) Stopping rules, number of epochs

    Number of hidden units

    s t 1

    1 exp t s t s t 1 s t

  • 7/29/2019 Fuzzy Lecturer

    24/101

    24

    Additional enhancements

    The problem: convergence may be very slowfor error functional with different curvatures:

    Solution: add momentum term tosmoothoscillations

    where and is momentum parameter

    w k1 w k k z k w k

    w k w k w k1

  • 7/29/2019 Fuzzy Lecturer

    25/101

    25

    Regularization Effect of Backpropagation

    Backpropagation ~ iterative optimization

    Final model (weights) depends on:- initial point + final point (stopping rules)

    initialization and/ or stopping rules can be usedfor model complexity control

  • 7/29/2019 Fuzzy Lecturer

    26/101

    26

    Various forms of complexity control

    MLP topology ~ number of hidden units

    Constraints on parameters (weights) ~weight decay

    Type of optimization algorithm (many

    versions of backprop., other opt. methods) Stopping rules

    Initial conditions(initial small weights)

    Multiple factors make it difficult to controlcomplexity; usually vary one complexityparameter while keeping all others fixed

  • 7/29/2019 Fuzzy Lecturer

    27/101

  • 7/29/2019 Fuzzy Lecturer

    28/101

    28

    Example: univariate regression

    Data set: 30 samples generatedusing sine-squared

    target function with Gaussian noise (st. deviation 0.1). MLP network

    (five hidden units)

    near optimal

    0 0.2 0.4 0.6 0.8 1-0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    X

    Y

  • 7/29/2019 Fuzzy Lecturer

    29/101

    29

    Example: univariate regression

    Data set: 30 samples generatedusing sine-squared

    target function with Gaussian noise (st. deviation 0.1). MLP network

    (20 hidden units)

    little overfitting

    0 0.2 0.4 0.6 0.8 1-0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    X

    Y

  • 7/29/2019 Fuzzy Lecturer

    30/101

    30

    Backpropagation for classification

    Original MLP is forregression

    (as shown)

    Forclassification:

    - use sigmoid output unit- during training, use real-values 0/1 for class labels

    - during operation, threshold the output of a trained MLP

    classifier at 0.5 to predict class labels

    W is m 1

    1 2 m

    V is d m

    y wjzjj 1

    m

    zj gx,vj

    x1 x2 xd

    z1 z2 zm

  • 7/29/2019 Fuzzy Lecturer

    31/101

    31

    Classification example (Ripleys data set)

    Data set: 250 samples ~ mixture of gaussians, where

    Class 0data has centers (-0.3, 0.7) and (0.4, 0.7), andClass 1 data has centers (-0.7, 0.3) and (0.3, 0.3).

    The variance of all gaussians is 0.03.

    MLP classifier

    (two hidden units)

    underfitting

    -1.5 -1 -0.5 0 0.5 1-0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

  • 7/29/2019 Fuzzy Lecturer

    32/101

  • 7/29/2019 Fuzzy Lecturer

    33/101

    33

    Classification Example

    MLP classifier(six hidden units)

    some overfitting

    -1.5 -1 -0.5 0 0.5 1-0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

  • 7/29/2019 Fuzzy Lecturer

    34/101

    34

    MLP software

    MLP software widely available in public domain

    For example, Netlab toolbox (in Matlab) athttp://www1.aston.ac.uk/eas/research/groups/n

    crg/resources/netlab/

    Many commercial products (full of Neural

    Network marketing hype), i.e.

    Nearly 80% Accurate Market Forecasting Software

    Get FREE up to date predictions and see for yourself!

    NetTalk

    http://www1.aston.ac.uk/eas/research/groups/ncrg/resources/netlab/http://www1.aston.ac.uk/eas/research/groups/ncrg/resources/netlab/http://www1.aston.ac.uk/eas/research/groups/ncrg/resources/netlab/http://www1.aston.ac.uk/eas/research/groups/ncrg/resources/netlab/
  • 7/29/2019 Fuzzy Lecturer

    35/101

    35

    NetTalk(Sejnowski and Rosenberg, 1987)

    One of the first successful applications of backpropagation:

    http://www.cnl.salk.edu/ParallelNetsPronounce/index.php Goal: Learning to read (English text) aloud, i.e.

    Learn Mapping:English textphonemes

    using MLP network

    Network inputs encode 7-letter window (the 4-th letterin the middle needs to be pronounced)

    Network outputs (26 units) encode phonemes thatdrive a speech synthesizer The MLP network is trained using labeled data (both

    individual words and unrestricted text)

    N tT lk hit t

    http://www.cnl.salk.edu/ParallelNetsPronounce/index.phphttp://www.cnl.salk.edu/ParallelNetsPronounce/index.php
  • 7/29/2019 Fuzzy Lecturer

    36/101

    36

    NetTalk architecture

    Input encoding: 7x29 = 203 units

    Output encoding: 26 units (phonemes)Hidden layer: 80 hidden units

    Li t i t N tT lk t d h

  • 7/29/2019 Fuzzy Lecturer

    37/101

    37

    Listening to NetTalk-generated speech

    Listen to tape recordings illustrating NETtalk operation. These recordings

    are available (in MP3 format) from an article in Wikipedia at

    http://en.wikipedia.org/wiki/NETtalk_(artificial_neural_network)

    This article has a link to the audio examples of the neural network as it

    progresses through training. Specifically, it hasthree recordings contain 3

    different audio outputs of NETtalk:

    (a) during the first 5 minutes of training, starting with weights initialized to zero.

    (b) after training using the set of 10,000 words. This training set corresponds to 20

    passes (epochs) over 500-word text.

    (c) generated with new text input from transcription that was not part of the training

    set.

    After listening to these recordings, answer and comment on the following questions:- can you recognize words in the beginning of recording (a)? in the end of (a)?

    - compare the quality of outputs (b) and (c). Which one seems closer to human

    speech and why?

    http://en.wikipedia.org/wiki/NETtalk_(artificial_neural_network)http://en.wikipedia.org/wiki/NETtalk_(artificial_neural_network)
  • 7/29/2019 Fuzzy Lecturer

    38/101

    38

    NETtalk: question for discussion

    NETtalk system uses a seven-letter window for

    text input. Try to justify this choice (of window

    size) based on the properties of natural English

    language. How the performance of NETtalk would

    change if a small window (of size 3 letters) or alarge window (of size 21 letters) is used instead?

  • 7/29/2019 Fuzzy Lecturer

    39/101

    39

    Radial Basis Function (RBF) networks

    Dictionary parameterization:

    - each b.f. is (usually) local

    - center and width

    i.e. Gaussian:

    Typically used for regression or classification

    W is m 1

    1 2 m

    V is d m

    y wjz

    jj 1

    m

    zj gx,vj

    x1 x2 xd

    z1 z2 zm

    01

    g wwfm

    j j

    j

    jm

    vxx

    2

    2

    12

    2

    2exp

    2expg

    vxx

    d

    j

    jj vx

    vj j

  • 7/29/2019 Fuzzy Lecturer

    40/101

    40

    RBF network training

    RBF training (learning) ~ estimation of

    (1) RBF parameters (centers, width)

    (2) linear weightsws

    Non-adaptive implementation:

    (1) Estimate RBF parameters via unsupervised learning(only x-values of training data) can use SOM, GLA etc.

    (2) Estimate weights w via linear least squares

    Advantages:

    - fast training;

    - when x-samples are plenty, but (x,y) data are few

    Limitations: cannot discard irrelevant inputs

    the curse of dimensionalty

  • 7/29/2019 Fuzzy Lecturer

    41/101

    41

    Non-adaptive RBF training algorithm1. Choose the number of basis functions

    (centers)m.2. Estimate centers using x-values of training data

    via unsupervised learning (SOM, GLA, clustering etc.)

    3. Determine width parameters using heuristic:

    For a given center(a) find the distance to the closest center:

    for all

    (b) set the width parameter

    where parameter controls degree of overlap betweenadjacent basis functions. Typically

    4. Estimate weights wvia linear least squares(minimization of the empirical risk).

    vj

    vj

    j

    rj minkvk vj k j

    j rj

    1 3

  • 7/29/2019 Fuzzy Lecturer

    42/101

    42

    RBF network complexity control

    RBF model complexity can be controlled by The number of RBFs:

    Goal: select opt number of units (RBFs)

    RBF width:Goal: select opt width parameter (for largenumber of RBFs)

    Penalization of large weights

    ws

    See toy examples next (using the number ofunits as the complexity parameter)

  • 7/29/2019 Fuzzy Lecturer

    43/101

    43

    Example: RBF regression

    Data set: 30 samples generatedusing sine-squared

    target function with Gaussian noise (st. deviation 0.1). RBF network: automatic width selection

    2 RBFs

    underfitting

    0 0.2 0.4 0.6 0.8 1-0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    X

    Y

  • 7/29/2019 Fuzzy Lecturer

    44/101

    44

    Example: RBF regression

    Data set: 30 samples generatedusing sine-squared

    target function with Gaussian noise (st. deviation 0.1). RBF network: automatic width selection

    5 RBFs

    ~ optimal

    0 0.2 0.4 0.6 0.8 1-0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    X

    Y

  • 7/29/2019 Fuzzy Lecturer

    45/101

    45

    Example: RBF regression

    Data set: 30 samples generatedusing sine-squared

    target function with Gaussian noise (st. deviation 0.1). RBF network: automatic width selection

    20 RBFs

    overfitting

    0 0.2 0.4 0.6 0.8 1-0.8

    -0.6

    -0.4

    -0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    X

    Y

  • 7/29/2019 Fuzzy Lecturer

    46/101

    46

    RBF Classification example (Ripleys data)

    Data set: 250 samples ~ mixture of gaussians, where

    Class 0data has centers (-0.3, 0.7) and (0.4, 0.7), andClass 1 data has centers (-0.7, 0.3) and (0.3, 0.3).

    The variance of all gaussians is 0.03.

    RBF classifier

    (4 units)

    little underfitting

    -1.5 -1 -0.5 0 0.5 1-0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

  • 7/29/2019 Fuzzy Lecturer

    47/101

  • 7/29/2019 Fuzzy Lecturer

    48/101

  • 7/29/2019 Fuzzy Lecturer

    49/101

  • 7/29/2019 Fuzzy Lecturer

    50/101

    50

    Overview

    Recall from Lecture Set 2:

    unsupervised learning

    data reduction approach

    Example:Training data represented by 3 centers

    H

  • 7/29/2019 Fuzzy Lecturer

    51/101

  • 7/29/2019 Fuzzy Lecturer

    52/101

    52

    2. Dimensionality reduction:linear nonlinear

    Note: the goal is to estimate a mapping from d-dimensionalinput space (d=2) to low-dim. feature space (m=1)

    R x f x, 2p x dx

    x1

    x2

  • 7/29/2019 Fuzzy Lecturer

    53/101

    53

    Unsupervised Learning: Formalization

    Unsupervised learning

    ~ mapping from the input space (x) to somemodel space

    For VQ/ clustering:

    a model is a set of centers (cluster centers) For dimensionality reduction:

    a model is a low-dim. Space

    Note 1: two types of problems can be combined

    Note 2:unsupervised learning requiresestimation of two mappings xz x*

    z = F(x) and x* = G(z)

  • 7/29/2019 Fuzzy Lecturer

    54/101

  • 7/29/2019 Fuzzy Lecturer

    55/101

    55

    Generalized Lloyd Algorithm(GLA) for VQ

    Givendata points , loss function L (i.e.,

    squared loss) and initial centers

    Performthe following updates upon presentation of1. Find the nearest center to the data point (the

    winning unit):

    2. Update the winning unit coordinates (only) via

    Increment kand iterate steps (1) (2) aboveNote:- the learning rate decreases with iteration numberk

    -biological interpretations of steps (1)-(2) exist

    cj 0 j 1,...,mx k k 1,2,...

    x k

    kkj ii

    cx minarg

    kkkkk jjj cxcc 1

    Batch version of GLA

  • 7/29/2019 Fuzzy Lecturer

    56/101

    56

    Batch version of GLAGivendata points , loss function L (i.e.,

    squared loss) and initial centers

    Iteratethe following two steps1. Partition the data (assign sample to unitj)using the nearest neighbor rule. Partitioning matrix Q:

    2. Update unit coordinates as centroids of the data:

    Note:final solution may depend on initialization (local min) potential problem for both on-line and batch GLA

    cj 0 j 1,...,mxi i 1,...,n

    qij 1 if L x i,cj k

    min

    l

    L x i,cl k 0 otherwise

    cj k 1

    qijx ii1

    n

    qiji 1

    n

    , j 1, . .. ,m

    xi

    Statistical Interpretation of GLA

  • 7/29/2019 Fuzzy Lecturer

    57/101

    57

    Statistical Interpretation of GLAIteratethe following two steps

    1. Partition the data (assign sample to unitj) usingthe nearest neighbor rule. Partitioning matrix Q:

    ~ Projection of the data onto model space(units) F(x)

    2. Update unit coordinates as centroids of the data:

    ~Conditional Expectation (averaging, smoothing) G(z)conditional upon results of partitioning step (1)

    qij 1 if L x i,cj k min

    lL x i,cl k

    0 otherwise

    cj k 1

    qijx ii1

    n

    qiji 1

    n

    , j 1, . .. ,m

    xi

  • 7/29/2019 Fuzzy Lecturer

    58/101

    58

    Numeric Example of univariate VQGiven data: {2,4,10,12,3,20,30,11,25}, set m=2

    Initialization (random): c1

    =3,c2

    =4

    Iteration 1

    Projection: P1={2,3} P2={4,10,12,20,30,11,25}

    Expectation (averaging): c1=2.5, c2=16

    Iteration 2

    Projection: P1={2,3,4}, P2={10,12,20,30,11,25}Expectation(averaging): c1=3, c2=18

    Iteration 3

    Projection: P1={2,3,4,10},P2={12,20,30,11,25}Expectation(averaging): c1=4.75, c2=19.6

    Iteration 4Projection: P1={2,3,4,10,11,12}, P2={20,30,25}Expectation(averaging): c1=7, c2=25

    Stop as the algorithm is stabilized with these values

  • 7/29/2019 Fuzzy Lecturer

    59/101

  • 7/29/2019 Fuzzy Lecturer

    60/101

    60

    GLA Example 2 Modeling doughnut distribution using 20 units:

    7 units were never moved by the GLA

    the problem of unused units (dead units)

  • 7/29/2019 Fuzzy Lecturer

    61/101

    61

    Avoiding local minima with GLA Starting with many random initializations,

    and then choosing the best GLA solution Conscience mechanism: forcing dead

    units to participate in competition, by keepingthe frequency count (of past winnings) foreach unit,

    i.e. for on-line version of GLA in Step 1

    Self-Organizing Map: introduce topologicalrelationship (map), thus forcing the neighborsof the winning unit to move towards the data.

    )(minarg kfreqkkj iii

    cx

  • 7/29/2019 Fuzzy Lecturer

    62/101

    62

    Clustering methods

    Clustering: separating a data set intoseveral groups (clusters) according to somemeasure of similarity

    Goals of clustering:

    interpretation (of resulting clusters)exploratory data analysis

    preprocessing for supervised learning

    often the goal is not formally stated

    VQ-style methods (GLA) often used forclustering, i.e. k-meansorc-means

    Many other clustering methods as well

  • 7/29/2019 Fuzzy Lecturer

    63/101

    63

    Clustering (contd) Clustering: partition a set of n objects

    (samples) into k disjoint groups, based onsome similarity measure. Assumptions:

    - similarity ~ distance metric dist (i,j)- usually k given a priori (but not always!)

    Intuitive motivation:similar objects into one clusterdissimilar objects into different clusters

    the goal is not formally stated

    Similarity (distance) measure is criticalbutusually hard to define (objectively).Distance needs to be defined fordifferenttypes of input variables.

  • 7/29/2019 Fuzzy Lecturer

    64/101

    64

    Applications of clustering

    Marketing:explore customers data to identify buying

    patterns for targeted marketing (Amazon.com)

    Economic data:

    identify similarity between different countries,

    states, regions, companies, mutual funds etc.

    Web data:

    cluster web pages or web users to discovergroups of similar access patterns

    Etc., etc.

  • 7/29/2019 Fuzzy Lecturer

    65/101

    65

    Clustering Methods Many different approaches developed in

    - neural networks

    - mathematics (graph theory, linearalgebra)

    - pattern recognition- data mining etc.

    Example Graph-theoretic approach:

    Minimum Spanning Tree (MST) clustering Types of clustering methods:

    hierarchical, partitional, fuzzy clustering.

  • 7/29/2019 Fuzzy Lecturer

    66/101

    66

    K-means clustering (~ GLA)

    This is representative Partitional Clustering method.

    Given a data set ofnsamples and the value of k:

    Step 0: (arbitrarily) initialize cluster centers

    Step 1: assign each data point (object) to the cluster

    with the closest cluster centerStep 2: calculate the mean (centroid) of data points

    in each cluster as estimated cluster centers

    Iterate steps 1 and 2 until the cluster membership isstabilized

    xi

  • 7/29/2019 Fuzzy Lecturer

    67/101

    67

    The K-MeansClustering Method

    Example

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    K=2

    Arbitrarily choose Kobject as initialcluster center

    Assigneachobjectsto mostsimilarcenter

    Updatethe

    clustermeans

    Updatetheclustermeans

    reassign

    reassign

  • 7/29/2019 Fuzzy Lecturer

    68/101

    68

    Clustering of High-Dimensional Data

    Additional Challenges

    - many clustering methods rely on intuition for

    low-dimensional data

    - visualization is possible only in 2D or 3D.

    Multidimensional scaling (MDS)aims to produce 2D representation of inter-point

    distances between high-dimensional samples.

    MDS finds a set of points in 2D space,which minimizes the stress function

    ij

    ji

    jiijnmS2

    21 ,,, zzzzz

    nzzZ ,...,1

  • 7/29/2019 Fuzzy Lecturer

    69/101

    69

    Self-Organizing MapsHistory and biological motivation

    Brain changes its internal structure to reflectlife experiences interaction withenvironment is critical at early stages ofbrain development (first 1-2 years of life)

    Existence of various regions (maps) in thebrain

    How these maps may be formed?

    i.e. information-processing model leading tomap formation

    T. Kohonen (early 1980s) proposed SOM

    Goal of SOM

  • 7/29/2019 Fuzzy Lecturer

    70/101

    70

    Goal of SOM

    Dimensionality reduction: project given (high-

    dimensional) data onto low-dimensional space (map) Feature space (Z-space) is 1D or 2D and is discretizedas a number of units, i.e., 10x10 map

    Z-space has distance metricordering of units Similarities and differences between VQ and SOM

    X F(Z)ZG(X)

    X

  • 7/29/2019 Fuzzy Lecturer

    71/101

    71

    Self-Organizing MapDiscretization of 2D space via 10x10 map. In this discrete

    space, distance relations exist between all pairs of units.Distance relation ~ map topology

    Units in 2D feature space

    SOM Algorithm (flow through)

  • 7/29/2019 Fuzzy Lecturer

    72/101

    72

    SOM Algorithm (flow through)

    Givendata points , distance metric in the

    input space (~ Euclidean), map topology (in z-space),initial position of units (in x-space)

    Performthe following updates upon presentation of

    1. Find the nearest center to the data point (thewinning unit):

    2. Update all units around the winning unit via

    Increment k,decrease the learning rate and theneighborhood width,and repeat steps (1) (2) above

    cj 0 j 1,...,m

    x k k 1,2,...

    x k

    1minarg)(* kkk ii

    cxz

    1*,1 kkKkkk jkjj cxzzcc

    SOM l (1 i i )

  • 7/29/2019 Fuzzy Lecturer

    73/101

    73

    SOM example (1-st iteration)

    Step 1:

    Step 2:

    SOM l ( t it ti )

  • 7/29/2019 Fuzzy Lecturer

    74/101

    74

    SOM example (next iteration)

    Step 1:

    Step 2:

    Final map

    Hyper-parameters of SOM

  • 7/29/2019 Fuzzy Lecturer

    75/101

    75

    Hyper parameters of SOM

    SOM performance depends on parameters (~ user-defined):

    Map dimension and topology (usually 1D or 2D) Number of SOM units ~ quantization level (ofz-space) Neighborhood function ~ rectangular or gaussian (not

    important)

    Neighborhood width decrease schedule (important),

    i.e. exponential decrease for Gaussian

    with user defined:

    Also linear decrease of neighborhood width

    Learning rate schedule (important)(also linear decrease)

    Note: learning rate and neighborhood decrease should beset jointly

    kK k 2

    2

    2

    zzzz exp, maxkkinitialfinalinitialk

    maxk initial final

    maxk

    k

    initial

    finalinitialk

    M d li if di t ib ti i SOM

  • 7/29/2019 Fuzzy Lecturer

    76/101

    76

    Modeling uniform distribution via SOM

    (a) 300 random samples (b) 10X10 map

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    SOM neighborhood: Gaussian

    Learning rate: linear decrease )/1(1.0)( maxkkk

    Position of SOM units: (a) initial, (b) after 50 iterations,( ) f 100 i i (d) f 10 000 i i

  • 7/29/2019 Fuzzy Lecturer

    77/101

    77

    (c) after 100 iterations, (d) after 10,000 iterations

    0 0.2 0.4 0.6 0.8 10

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 10

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 10

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 10

    0.2

    0.4

    0.6

    0.8

    1

    Batch SOM (similar to batch GLA)

  • 7/29/2019 Fuzzy Lecturer

    78/101

    78

    Batch SOM (similar to batch GLA)Givendata points , loss function L (i.e.,

    squared loss) and initial centers

    Iteratethe following stepsPartition the data (assign sample to unitj) using the

    nearest neighbor rule. Partitioning matrix Q:

    Update unit coordinates as weighted average of all samples:

    where is the weight of sample

    Decrease the neighborhood widthIterate (repeat)above steps max number of iterations Kmax

    cj 0 j 1,...,mxi i 1,...,n

    qij 1 if L x i,cj k min

    lL x i,cl k

    0 otherwise

    xi

    n

    i

    ij

    n

    i

    iji

    j

    K

    K

    kc

    1

    1

    ,

    ,

    )1(

    zz

    zzx

    ix),( ijK zz

    SOM Example 1

  • 7/29/2019 Fuzzy Lecturer

    79/101

    79

    SOM Example 1

    Modeling doughnut distribution using batch SOM linear (1D) SOM topology with 5 units

    - initial neighborhood =1, final neighborhood = 0.05

    - number of iterations Kmax=50

    Note: no unused units:

    -1.5 -1 -0.5 0 0.5 1 1.5-1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    X1

    X2

    SOM Example 2

  • 7/29/2019 Fuzzy Lecturer

    80/101

    80

    SOM Example 2 Modeling the same doughnut distribution using:

    square grid (2D) SOM topology with 5x5=25 units

    - initial neighborhood =1, final neighborhood = 0.05

    - number of iterations Kmax=50

    Note:final model not sensitive

    to poor initialization

    -1.5 -1 -0.5 0 0.5 1 1.5-1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    X1

    X2

    SOM A li ti d V i ti

  • 7/29/2019 Fuzzy Lecturer

    81/101

    81

    SOM Applications and Variations

    Main web site: Helsinki University of Technology (HUT)http://www.cis.hut.fi/research/som-research/

    Numerous Applications

    Marketing surveys/ segmentation

    Financial/ stock market data

    Text data / document map WEBSOM

    Image data / picture map - PicSOMsee HUT web site

    Practical Issues for SOM

    http://www.cis.hut.fi/research/som-research/http://www.cis.hut.fi/research/som-research/http://www.cis.hut.fi/research/som-research/http://www.cis.hut.fi/research/som-research/
  • 7/29/2019 Fuzzy Lecturer

    82/101

    82

    Practical Issues for SOM Pre-scaling of inputs, usually to [0, 1]

    range. Why? Map topology: usually 1D or 2D

    Number of map units (per dimension)

    Learning rate schedule (for on-lineversion)

    Neighborhood type and schedule:

    Initial size (~1), final sizeFinal neighborhood size + number ofunits determine model complexity.

    Modeling US states using 1D SOM

  • 7/29/2019 Fuzzy Lecturer

    83/101

    83

    Modeling US states using 1D SOM

    Purpose: clustering of US states

    Data encoding: each state described by 5socio-economic indicators: obesity index,

    result of 2004 presidential elections, median

    income, mean NAEP, IQ score Data scaling: each input scaled

    independently to [0,1] range

    SOM specs: 1D map, 9 units, initialneighborhood width 1, final width 0.05

    State Obesity index Election 04 Median Income Mean NAEP IQ score

  • 7/29/2019 Fuzzy Lecturer

    84/101

    84

    State Obesity index Election_04 Median Income Mean NAEP IQ score

    Hawaii 17 0 49775 238 94

    Colorado 17 1 49617 252 104

    Connecticut 18 0 53325 255 99

    Massachusetts 18 0 50587 257 111

    New Hampshire 18 1 53549 257 102Utah 18 1 48537 250 89

    California 19 0 48113 238 94

    Maryland 19 0 55912 248 95

    New Jersey 19 0 53266 253 103

    Rhode Island 19 0 44311 245 89

    Vermont 19 0 41929 256 102

    Florida 19 1 38533 245 87

    Montana 19 1 33900 254 100

    Oregon 20 0 42704 250 100

    Arizona 20 1 41554 241 92

    Idaho 20 1 38613 249 96

    New Mexico 20 0 35251 235 85

    Wyoming 20 1 40499 253 102

    Maine 21 0 37654 253 99

    New York 21 0 42432 251 90

    Washington 21 0 44252 251 92

    South Dakota 21 1 38755 254 100Delaware 22 0 50878 250 90

    Illinois 22 0 45906 248 93

    Minnesota 22 0 54931 256 113

    Wisconsin 22 0 46351 252 105

    Nevada 22 1 46289 239 92

    Alaska 23 1 55412 245 92

  • 7/29/2019 Fuzzy Lecturer

    85/101

    85

    Iowa 23 0 41827 253 109

    Kansas 23 1 42523 253 101

    Missouri 23 1 43955 251 92Nebraska 23 1 43566 251 101

    North Dakota 23 1 36717 254 111

    Ohio 23 1 43332 252 107

    Oklahoma 23 1 35500 244 98

    Pennsylvania 24 0 43577 249 99

    Arkansas 24 1 32423 242 98

    Georgia 24 1 43316 243 93

    Indiana 24 1 41581 251 105

    North Carolina 24 1 38432 252 106

    Virginia 24 1 49974 253 99

    Michigan 25 0 45335 249 99

    Kentucky 25 1 37893 247 94

    Tennessee 25 1 36329 241 90

    Alabama 26 1 36771 236 90

    Louisiana 26 1 33312 238 99

    South Carolina 26 1 38460 246 87

    Texas 26 1 40659 247 98Mississippi 27 1 32447 236 90

    West Virginia 28 1 30072 245 92

    SOM Modeling 1 of US states

  • 7/29/2019 Fuzzy Lecturer

    86/101

    86

    SOM Modeling 1 of US statesUnit States (assigned to each unit)

    1 HI, CA, MD, RI, NM,

    2 OR, ME, NY, WA, DE, IL, PA, MI,

    3 CT, MA, NJ, VT, MN, WI,

    4

    5 CO, NH, MT, WY, SD,

    6 KS, NE, ND, OH, IN, NC, VA,

    7 UT, ID, AK, IA, MO,

    8 FL, AZ, NV, OK, GA, KY, TX

    9 AR, TN, AL, LA, SC, MS, WV

  • 7/29/2019 Fuzzy Lecturer

    87/101

    87

    SOM Modeling 2 of US states

  • 7/29/2019 Fuzzy Lecturer

    88/101

    88

    - remove voting input and apply 1D SOM:Unit States

    1 CO, CT, MA, NH, NJ, MN,

    2 WI, IA, ND, OH, IN, NC,

    3 VT, MT, OR, ID, WY, ME, SD,

    4 KS, MO, NE, PA, VA, MI,

    5 UT, MD, NY, WA, DE, IL, AK,

    6 HI, CA , RI,

    7 FL, AZ, NM, NV,

    8 OK, GA, KY, SC, TX,

    9 AR, TN, AL, LA, MS, WV

    SOM Modeling 2 of US states (contd)

  • 7/29/2019 Fuzzy Lecturer

    89/101

    89

    - remove voting input and apply 1D SOM:

    Tree structured SOM

  • 7/29/2019 Fuzzy Lecturer

    90/101

    90

    Tree-structured SOM

    Fixed SOM topologygives poor modeling of

    structured distributions:

    Minimum SpanningTree SOM

  • 7/29/2019 Fuzzy Lecturer

    91/101

    91

    Minimum SpanningTree SOM

    Define SOM topology adaptively during each iteration

    of SOM algorithm Minimum Spanning Tree (MST) topology ~ according

    to distance between units (in the input space)

    Topological distance ~ number of hops in MST

    123

    Example of using MST SOM

  • 7/29/2019 Fuzzy Lecturer

    92/101

    92

    Example of using MST SOM

    Modeling cross distribution

    MST topology vs fixed 2D grid map

    Application: skeletonization of images

  • 7/29/2019 Fuzzy Lecturer

    93/101

    93

    Singh at al, Self-organizing maps for the skeletonization of sparse

    shapes, IEEE Trans Neural Networks, 11, Issue 1, Jan 2000

    Skeletonization of noisy images

    Application of MST SOM: robustness with

    respect to noise

    Clustering of European Languages

  • 7/29/2019 Fuzzy Lecturer

    94/101

    94

    g p g g

    Background historical linguistics studiesrelatedness btwn languages based on

    phonology, morphology, syntax and lexicon

    Difficulty of the problem:due to evolving

    nature of human languages and globalization.

    Hypothesis: similarity based on analysis of asmall stable word set.

    See glottochronology, Swadesh list, athttp://en.wikipedia.org/wiki/Glottochronology

    SOM for clustering European languages

    http://en.wikipedia.org/wiki/Glottochronologyhttp://en.wikipedia.org/wiki/Glottochronology
  • 7/29/2019 Fuzzy Lecturer

    95/101

    95

    g p g gModeling approach: language ~ 10 word set.

    Assuming words in different languages are encodedin the same alphabet, it is possible to performclustering using some distance measure.

    Issues:

    selection of stable word set

    data encoding + distance metric

    Stable word set: numbers 1 to 10

    Data encoding: Latin (English) alphabet,use 3 first letters (in each word)

    Numbers word set in 18 European languages

  • 7/29/2019 Fuzzy Lecturer

    96/101

    96

    p g g

    Each language is a feature vector encoding 10 words

    En

    glish

    Norwegia

    n

    Polish

    Czech

    Slovakian

    Flemish

    Croatian

    Portu

    guese

    French

    Spanish

    Italian

    Swedish

    Danish

    Finnish

    Estonian

    Dutch

    German

    Hun

    garian

    one en jeden jeden jeden ien jedan um un uno uno en en yksi uks een erins egy

    two to dwa dva dva twie dva dois deux dos due tva to kaksi kaks twee zwei ketto

    three tre trzy tri tri drie tri tres trois tres tre tre tre kolme kolme drie drie harom

    four fire cztery ctyri styri viere cetiri quarto quatre cuatro quattro fyra fire nelja neli vier vier negy

    five fem piec pet pat vuvve pet cinco cinq cinco cinque fem fem viisi viis vijf funf ot

    six seks szesc sest sest zesse sest seis six seis sei sex seks kuusi kuus zes sechs hatseven sju sediem sedm sedem zevne sedam sete sept siete sette sju syv seitseman seitse zeven sieben het

    eight atte osiem osm osem achte osam oito huit ocho otto atta otte kahdeksan kaheksa acht acht nyolc

    nine ni dziewiec devet devat negne devet nove neuf nueve nove nio ni yhdeksan uheksa negen neun kilenc

    ten ti dziesiec deset desat t iene deset dez dix dies dieci tio ti kymmenen kumme tien zehn tiz

    Data Encoding

  • 7/29/2019 Fuzzy Lecturer

    97/101

    97

    g Word ~ feature vector encoding 3 first letters

    Alphabet ~ 26 letters + 1 symbol BLANKvector encoding:

    For example, ONE : O~14 N~15 E~05

    ALPHABET INDEX

    BLANK 00

    A 01

    B 02

    C 03D 04

    X 24

    Y 25

    Z 26

    Word Encoding (contd)

  • 7/29/2019 Fuzzy Lecturer

    98/101

    98

    Word 27-dimensional feature vector

    Encoding isinsensitive to order (of 3 letters)

    Encoding of 10-word set: concatenate featurevectors of all words: one + two + + ten

    word set encoded as vector of dim. [1 X 270]

    one (Word)

    15 14 05 (Indices)

    0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0

    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

    SOM Modeling Approach2 Di i l SOM (B t h Al ith )

  • 7/29/2019 Fuzzy Lecturer

    99/101

    99

    2-Dimensional SOM (Batch Algorithm)Number of Knots per dimension=4

    Initial Neighborhood =1 Final Neighborhood = 0.15

    Total Number of Iterations= 70

    OUTLINE

  • 7/29/2019 Fuzzy Lecturer

    100/101

    100

    OUTLINE

    Objectives

    Brief history and motivation for artificialneural networks

    Sequential estimation of modelparameters

    Methods for supervised learning

    Methods for unsupervised learning Summary and discussion

    Summary and Discussion

  • 7/29/2019 Fuzzy Lecturer

    101/101

    Summary and Discussion

    Neural Network methods (vs statistical

    approaches):- new techniques/ new insights

    - simple (brute-force) computational approaches

    - biological motivation

    The same fundamental issues: small-sampleproblems, curse-of-dimensionality, non-linearoptimization, complexity control

    Neural network methods implement risk-minimization (predictive learning setting)

    Hype and controversy