business intelligence & data mining-10

Upload: binzidd007

Post on 02-Jun-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 Business Intelligence & Data Mining-10

    1/39

    Artificial Neural Networks

    (ANNs)

  • 8/10/2019 Business Intelligence & Data Mining-10

    2/39

    Background and Motivation

    Computers and the Brain: A Contrast

    Arithmetic: 1 brain = 1/10 pocket calculator

    Vision: 1 brain = 1000 super computers

    Memory of arbitrary details: computer wins

    Memory of real-world (related) facts: brain wins

    A computer must be programmed explicitly

    The brain can learn by experiencing the world

  • 8/10/2019 Business Intelligence & Data Mining-10

    3/39

    A system loosely modeled on the human brain that simulates themultiple layers of simple processing elements called neurons.

    There are various classes of ANN models

    depending on:

    q Problem type

    Prediction, Classification , Clustering

    q Design / Structure of the model

    q

    Model building algorithm

    What are Artificial Neural Networks?

  • 8/10/2019 Business Intelligence & Data Mining-10

    4/39

    Most important functional unit in human brain a class of cells called

    NEURONS

    Dendrites Receive information

    A b it of biology . . .

    Actual Neuron

    Cell Body Process information Axon Carries processed information to other neurons

    Synapse Junction between Axon end and Dendrites of other Neurons

    Dendrites

    Cell Body

    Axon

    Schematic

    Synapse

  • 8/10/2019 Business Intelligence & Data Mining-10

    5/39

    Artificial Neurons

    simulated on hardware or by software

    input connections - the receivers

    node, unit, Processing UnitorProcessing Element

    simulates neuron body

    output connection and transfer function - the

    transmitter (axon)

    activation function employs a threshold or bias

    connection weights act as synaptic junctions

    Learning (for a particular model) occurs via changes in

    value of the various weights and choice of functions.

  • 8/10/2019 Business Intelligence & Data Mining-10

    6/39

    An Art i f ic ial Neuron : Structure

    Receives Inputs X1 X2 Xp from other neurons or environment

    Inputs fed-in through connections with weights

    Total Input = Weighted sum of inputs from all sources

    Transfer function and Activation function convert the input to output Output goes to other neurons or environment

    f

    X1

    X2

    Xp

    I

    I = w1X1 + w2X2

    + w3X3 + + wpXp

    V = f(I)

    w1

    w2

    .

    .

    .wp

    Dendrites Cell Body Axon

    Direction of flow of Information

  • 8/10/2019 Business Intelligence & Data Mining-10

    7/39

    structure of each node (threshold /activation / transfer function)

    structure of the network (architecture)

    weights on each of the connections

    .... these must be learned !

    Behaviour of an artificial neural network

    to any particular input depends upon

  • 8/10/2019 Business Intelligence & Data Mining-10

    8/39

    Mathematical Model of a Node

    Incoming

    activation

    Outgoing

    activation

    a0

    ai

    an

    wi

    wn

    w0

    dde unction_

    FunctionActivation_

    Activity in an ANN Node

  • 8/10/2019 Business Intelligence & Data Mining-10

    9/39

  • 8/10/2019 Business Intelligence & Data Mining-10

    10/39

    Incoming

    activation

    Outgoing

    activation

    a0

    ai

    an

    wi

    wn

    w0

    dde unction_

    FunctionActivation

    =t=Iif1)(If

    Activity in an ANN Node

  • 8/10/2019 Business Intelligence & Data Mining-10

    11/39

    There are various choices for Transfer / Activation functions

    Tanh

    f(x) =

    (ex e-x) / (ex + e-x)

    -1

    1

    0

    1

    0.5

    0

    1

    Logistic

    f(x) = ex / (1 + ex)

    Threshold

    0 if x< 0f(x) =

    1 if x >= 1

    Trans fer Func t ions

  • 8/10/2019 Business Intelligence & Data Mining-10

    12/39

    Artificial Neural Network Architecture

    The most common neural network

    architectures has three layers:

    - 1st Layer: input layer

    -2nd Layer: hidden layer

    -3

    rd

    Layer: output layer

  • 8/10/2019 Business Intelligence & Data Mining-10

    13/39

    Number of hidden layers can be None One More

    More complex ANN Archi tectures

  • 8/10/2019 Business Intelligence & Data Mining-10

    14/39

    Feed-forw ard Art i f ic ial Neural Network A rch itecture

    A collection of neurons form a Layer

    Directionofinform

    ationflow

    X1 X2 X3 X4

    y1 y2

    Input Layer

    - Each neuron gets ONLY

    one input, directly from outside

    Output Layer

    - Output of each neurondirectly goes to outside

    Hidden Layer- Connects Input and Output

    layers

  • 8/10/2019 Business Intelligence & Data Mining-10

    15/39

    A few restrictions

    X1 X2 X3 X4

    y1 y2

    Within a layer neurons are NOT

    connected to each other.

    Neuron in one layer is connected

    to neurons ONLY in the NEXTlayer. (Feed-forward)

    Jumping of layer is NOT

    allowed

    Feed-forw ard Art i f ic ial Neural Network A rch itecture

  • 8/10/2019 Business Intelligence & Data Mining-10

    16/39

    What do we mean by A particular Model ?

    Input: X1

    X2

    X3

    Output: Y Model: Y = f(X1

    X2

    X3

    )

    For an ANN : Algebraic form of f(.) is too complicated to write down.

    However it is characterized by

    # Input Neurons

    # Hidden Layers # Neurons in each Hidden Layer

    # Output Neurons

    The adder, activation and transfer functions

    WEIGHTS for all the connections

    Fitting an ANN model = Specifying values for all those parameters

    An ANN mode l

  • 8/10/2019 Business Intelligence & Data Mining-10

    17/39

    Free parametersDecided by the structureof the problem

    # Input Nrns = # of Xs# Output Nrns = # of Ys

    ANN Model an Examp le

    Input: X1 X2 X3 Output: Y Model: Y = f(X1 X2 X3)

    X1X2 X3

    Y

    0.5

    0.6 -0.1 0.1 -0.2

    0.7

    0.1 -0.2

    Parameters Example

    # Input Neurons 3

    # Hidden Layers 1

    # Hidden Layer Size 2

    # Output Neurons 1

    Weights Learnt

  • 8/10/2019 Business Intelligence & Data Mining-10

    18/39

    Predict ion using a part icular ANN Model

    Input: X1 X2 X3 Output: Y Model: Y = f(X1 X2 X3)

    0.5

    0.6 -0.1 0.1 -0.2

    0.7

    0.1 -0.2

    X1 =1 X2=-1 X3 =2

    0.2

    f (0.2) = 0.55

    0.55

    0.9

    f (0.9) = 0.71

    0.71

    -0.087

    f (-0.087) = 0.478

    0.478

    0.2 = 0.5 * 1 0.1*(-1) 0.2 * 2

    Predicted Y = 0.478

    Suppose Actual Y = 2

    Then

    Prediction Error = (2-0.478) =1.522

    f(x) = ex / (1 + ex)

    f(0.2) = e0.2 / (1 + e0.2) = 0.55

  • 8/10/2019 Business Intelligence & Data Mining-10

    19/39

    Real Estate Appraisal by Neural Networks

  • 8/10/2019 Business Intelligence & Data Mining-10

    20/39

    Input: X1 X2 X3 Output: Y Model: Y = f(X1 X2 X3)

    # Input Neurons = # Inputs = 3 # Output Neurons = # Outputs = 1

    # Hidden Layer = ??? Try 1

    # Neurons in Hidden Layer = ??? Try 2

    No fixed strategy.

    By trial and error

    Architecture is now defined How to get the weights ???

    Given the Architecture There are 8 weights to decide.W = (W1, W2, , W8)

    Training Data: (Yi , X1i, X2i, , Xpi ) i= 1,2,,n

    Given a particular choice of W, we will get predicted Ys

    ( V1,V2,,Vn)They are function of W.

    Choose W such that over all prediction error Eis minimized

    E= (Yi Vi) 2

    Bu i ld ing ANN Model

  • 8/10/2019 Business Intelligence & Data Mining-10

    21/39

    E= (Yi Vi) 2

    Start with a random set of weights.

    Feed forward the first observation through the net

    X1 Network V1 ; Error = (Y1 V1)

    Adjust the weights so that this error is reduced

    ( network fits the first observation well )

    Feed forward the second observation.Adjust weights to fit the second observation well

    Keep repeating till you reach the last observation

    This finishes one CYCLE through the data

    Randomize the order of the input (observations)

    Perform many such training cycles till the

    overall prediction error Eis small.

    F

    eed

    forward

    Back

    Propagation

    Train ing the Model

  • 8/10/2019 Business Intelligence & Data Mining-10

    22/39

    E= (Yi Vi)2

    Each weight Shares the Blame for prediction

    error and tries to adjust with other weights.

    Back Propagation algorithm decides how to

    distribute the blame among all weights andadjust the weights accordingly.

    Small portion of blame leads to small

    adjustment.

    Large portion of the blame leads to largeadjustment.

    Back Propagat ion

  • 8/10/2019 Business Intelligence & Data Mining-10

    23/39

    Weight adjustment formula in Back Propagation

    Gradient Descent Method :

    For every individual weight Wi, updation formula looks like

    Wnew = Wold + * ( E / W) |Wold

    = Learning Parameter (between 0 and 1)

    W(t+1) = W(t) + * ( E / W) |W(t) + * (W(t) - W(t-1) )

    = Momentum (between 0 and 1)

    E( W)= [ Yi Vi( W) ] 2

    Vi the prediction for ith observation

    is a function of the network weights vector W = ( W1, W2,.)Hence, E, the total prediction error is also a function of W

    Another slight variation is also used sometimes

    Weight adjustment during Back Propagation

  • 8/10/2019 Business Intelligence & Data Mining-10

    24/39

    Geometr ic interpretat ion o f the Weigh t adjustment

    E( w1, w2)= [ Yi Vi(w1, w2 ) ]2

    Consider a very simple network with 2 inputs and 1 output. No hidden layer.

    There are only two weights whose values needs to be specified.

    w1 w2

    A pair ( w1, w2 ) is a point on 2-D plane.

    For any such point we can get a value of E.

    Plot E vs ( w1, w2 ) - a 3-D surface - Error Surface

    Aim is to identify that pair for which E is minimum

    That means identify the pair for which the height of

    the error surface is minimum.

    Gradient Descent Algorithm

    Start with a random point ( w1, w2 )

    Move to a better point ( w1, w2 ) where the height of error surface is lower.

    Keep moving till you reach ( w*1, w*2 ), where the error is minimum.

  • 8/10/2019 Business Intelligence & Data Mining-10

    25/39

    Crawling the Error Surface

    -3.0

    00

    -2.0

    00

    -1.0

    00

    0.0

    00

    1.0

    00

    2.0

    00

    3.0

    00

    4.0

    00

    5.0

    00

    6.0

    00

    -3.0

    00

    -2.0

    00

    -1.0

    00

    0.0

    00

    1.0

    00

    2.0

    00

    3.0

    00

    4.0

    00

    5.0

    00

    6.0

    00

    0.0

    2.0

    4.0

    6.0

    8.0

    10.0

    12.0

    14.0

    Error

    W1

    W2

    w*w0

    Error Surface

    Local Minima

    Weight Space

    Global Minima

  • 8/10/2019 Business Intelligence & Data Mining-10

    26/39

    E= (Yi Vi) 2

    Decide the Network architecture

    (# Hidden layers, #Neurons in each Hidden Layer)

    Training Algori thm

    Initialize the Network with random weights

    Feed forward the I-th observation thru the NetCompute the prediction error on I-th observation

    Back propagate the error and adjust weights

    Check for Convergence

    For I = 1 to # Training Data points

    Next I

    Do till Convergence criterion is not met

    End Do

    Decide the Learning parameter and Momentum

  • 8/10/2019 Business Intelligence & Data Mining-10

    27/39

    When to stop training the Network ?

    Convergence Cri ter ion

    Ideally when we reach the global minima of the error surface

    Suggestions:

    1. Stop if the decrease in total prediction error (since last cycle) issmall.

    2. Stop if the overall changes in the weights (since last cycle) aresmall.

    We dont How do we know we have reached there ?

    Drawback:

    Error keeps on decreasing. We get a very good fit to training data.BUT The network thus obtained have poor generalizing poweron unseen data

    The phenomenon is also known as - Over fitting of the Training dataThe network is said to Memorize the training data.

    - so that when an X in training set is given,

    the network faithfully produces the corresponding Y.

    -However for Xs which the network didnt see before, it predicts poorly.

  • 8/10/2019 Business Intelligence & Data Mining-10

    28/39

    Convergence Cri ter ion

    Modified Suggestion:

    Partition the training data into Training set and Validation set

    Use: Training set - build the model

    Validation set - test the performance of the model on unseen data

    Typically as we have more and more training cycles

    Error on Training set keeps on decreasing.

    Error on Validation set first decreases and then increases.

    Error

    Cycle

    Validation

    Training

    Stop training when the error on

    Validation set starts increasing

  • 8/10/2019 Business Intelligence & Data Mining-10

    29/39

    Choice o f Training Parameters

    Learning Parameter

    Too big Large leaps in weight space risk of missing global minima.Too small

    - Takes long time to converge to global minima

    - Once stuck in local minima, difficult to get out of it.

    Learning Parameter and Momentum

    - needs to be supplied by user from outside. Should be between 0 and 1

    What should be the optimal values of these training parameters ?

    - No clear consensus on any fixed strategy.- However, effects of wrongly specifying them are well studied.

    Suggestion

    Trial and Error Try various choices of Learning Parameter and MomentumSee which choice leads to minimum prediction error

  • 8/10/2019 Business Intelligence & Data Mining-10

    30/39

    Summary

    q Artificial Neural network (ANN)

    A class of models inspired by biological Neurons

    q Used for various modeling problems Prediction, Classification, Clustering, ..

    q One particular subclass of ANNs Feed forward Back propagation networks

    v Organized in layers Input, hidden, Output

    v Each layer is a collection of a number of artificial Neurons

    v Neurons in one layer in connected to neurons in next layerv Connections have weights

    q Fitting an ANN model is to find the values of these weights.

    q Given a training data set weights are found by Feed forward Backpropagation algorithm, which is a form of Gradient Descent Method a

    popular technique for function minimization.

    q Network architecture as well as the training parameters are decided upon by

    trial and error. Try various choices and pick the one that gives lowestprediction error.

  • 8/10/2019 Business Intelligence & Data Mining-10

    31/39

    Self-Organizing Maps

  • 8/10/2019 Business Intelligence & Data Mining-10

    32/39

    Overview

    A Self-Organizing Map (SOM) is a way torepresent higher dimensional data in an usually 2-D or 3-D manner, such that similar data is groupedtogether.

    It runs unsupervised and performs the grouping onits own.

    Once the SOM converges, it can classify new data. SOMs run in two phases:

    Training phase: the map is built, the network organizesusing a competitive process, it is trained using a largenumbers of inputs.

    Mapping phase: new vectors are quickly given alocation on the converged map, easily classifying orcategorizing the new data.

  • 8/10/2019 Business Intelligence & Data Mining-10

    33/39

    Example

    Example: Data on poverty levels in different countries. SOM does not show poverty levels, rather it shows how similar the

    poverty levels of different countries are to each other. (Similarcolor = similar poverty level).

  • 8/10/2019 Business Intelligence & Data Mining-10

    34/39

    Forming the Map

  • 8/10/2019 Business Intelligence & Data Mining-10

    35/39

    The Basic Process1) Each pixel is a node / unit.

    2) Initialize each nodes weights with random values.

    3) Take the next vector from training data and present it to theSOM.

    4) Every node is examined to find the Best Matching Unit /Node (BMU).

    5) Adjust BMUs weights to make it come closer to the inputvector. The rate of adjustment is determined by the learning

    rate. The learning rate decreases with each iteration.6) The radius of the neighborhood around the BMU is

    calculated. The size of the neighborhood decreases with eachiteration.

    7) Each node in the BMUs neighborhood has its weightsadjusted to become more like the BMU. Nodes closest to the

    BMU are altered more than the nodes further away in theneighborhood.

    8) Repeat from step 3 to 7 for each vector in the training data.

    9) Repeat from step 3 to 8 for enough iterations for convergence.

  • 8/10/2019 Business Intelligence & Data Mining-10

    36/39

    Calculating the Best Matching Unit

    Calculating the BMU is done according to theEuclidean distance between the nodes weights(W1, W2, , Wn) and the input vectors values(V1, V2, , Vn).

    This gives a good measurement of how similar the twosets of data are to each other.

    The node closest to the input vector is chosen as theBMU (or winning node).

    D t i i th BMU

  • 8/10/2019 Business Intelligence & Data Mining-10

    37/39

    Determining the BMU

    Neighborhood

    Size of the neighborhood: uses an exponent ial decayfunction that shrinkson each iteration until eventually the neighborhood is just the BMU itself.

    Effect of location within the neighborhood: The neighborhood is defined by agaussian curve so that nodes that are closer are influenced more than farthernodes.

  • 8/10/2019 Business Intelligence & Data Mining-10

    38/39

    Modifying the Nodes Weights

    The new weight for a node is the old weight, plus a fraction (L) ofthe difference between the old weight and the input vectoradjusted (by theta) depending on the distance from the BMU.

    The learning rate, L, is also an exponential decay function. This ensures that the SOM will converge.

    The lambda represents a constant, and t is the time step

    SOM Application: Text Clustering

  • 8/10/2019 Business Intelligence & Data Mining-10

    39/39

    SOM Application: Text Clustering