ml: anns - intro arti cial neural networks (anns) have a ...djmoon/ml/ml-notes/ml-anns.pdf · {...

ML: ANNs - Intro

• Artificial Neural Networks (ANNs) have a biological basis

– They are modeled after the neural structure of the brain

– Each neuron is connected to many others

– A neuron receives signals from multiple inputs, which are excitatory orinhibitory

– The neuron is either activated or not

– If activated, it sends a signal to other neurons

• ANNs are studied for two reasons:

1. As a model of the brain in a an attempt to understand how the brain works

2. As a model for machine learning

– The goal of this approach is to achieve effective machine learning algo-rithms, not to simulate biological processe

• Advantages to the ANN approach:

1. It is a robust technique when data is noisy

2. Performance degrades gracefully when nodes fail

3. Useful for real-valued functions and for symbolic representations

• ANNs are good for problems with the following characteristics:

1. Instances consist of many features

– They may be related or independent

– Input values are reals

2. Target function output can be discrete or real, single or multi-valued

3. Training data may contain errors

4. Long training time is acceptable

5. May require fast evaluation of the target function

– May need to make fast decisions once trained

6. Human understanding of the reasoning process not essential

– Decision process of ANNs not amenable to explanation

1

ML: ANNs - Perceptrons

• The perceptron is the earliest representation of an ANN:

– The sum is a linear combination of the inputs

– The input is a set of weighted values (wixi)

– Each weight represents the importance/relevance/contribution of a partic-ular input to the output

– Output o(x1, x2, ..., xn):

o(x1, x2, ..., xn) =

{1 when w0x0 + w1x1 + ...+ wnxn > 0−1 otherwise

– x0 is a bias term, where x0 = 1, and w0 is a threshold that must be reachedin order for the perceptron to fire

• The summation can be written as a vector product:

wx > 0

wherew =

[w0 w1 ... wn

]and

x =[x0 x1 ... xn

]TThen:

o(x) = sign(wx)

where

sign(y) =

{1 when y > 0−1 otherwise

2

ML: ANNs - Perceptrons (2)

• Hypothesis space H:

H ={w|w ∈ R(n+1)

}• Perceptrons can represent any function that is linearly separable

– Functions whose values can be separated by a hyperplane in which thepositive examples lie on one side of the plane and the negative examples lieon the other side

– The plane represents a decision surface

– This includes the Boolean functions AND, OR, NAND, and NOR

– However, not all Booleans are linearly separable; e.g. XOR:

3

ML: ANNs - Perceptrons (3)

• The perceptron training rule

– This trains a single perceptron

– Output is ±1

– The basic technique:

1. Randomly assign weights to the inputs

2. Apply the perceptron to each training instance and modify the weightswhen it mis-classifies an instance

3. Repeat until the entire training set is correctly classified

– The algorithm:

GD (te, η)initialize each wi to some small random valueuntil done

∆wi ← 0for each (x, t) ∈ te

compute o given x

for each wi

∆wi ← ∆wi + η(t− o)xifor each wi

wi ← wi + ∆wi

where

te is a set of training data of the form (x, t)x are feature valuest is a target value

– The perceptron training rule refers to the algorithm for adjusting weights:

wi ← wi + ∆wi

where

∆wi = η(t− o)xi,t is the target output,o is the perceptron output, andη is a constant called the learning rate

• Note that if o < t, the above increases w and vice-versa

4

ML: ANNs - Gradient Descent and the Delta Rule

• The Perceptron Rule is guaranteed to converge on a solution providing thetraining set is linearly separable, but not otherwise

• The Delta Rule (gradient descent) will converge in such situations

– See notes on regression for details of gradient descent

• Consider an unthresholded perceptron where o(x) = wx

– This is called a linear unit

– Error can be computed using MAE:

E(w) =1

2

∑d∈D

(td − od)2

– This can also be applied stochastically

– This differs from the Perceptron Rule in that in the Delta Rule, o(x) = wx,while in the Perceptron Rule o(x) = sign(wx)

• Comparison of the Perceptron and Delta Rules:

– The Delta Rule can achieve the same results as the Percepton Rule by usingappropriate values of t = ±1 to correspond to Perceptrom Rule outputs

– The Perceptron Rule converges after a finite number of iterations to a hy-pothesis that perfectly classifies the training set (providing the set is lineralyseparable)

∗ The Delta Rule converges asymptotically toward the minimal error hy-pothesis, in possibly unbounded time, regardless of whether the trainingset is linearly separable

5

ML: ANNs - Multilayer Networks - Intro

• Multilayer networks can represent more complex functions

– Decision surfaces can be complex and not just linearly separable

• Need to use a different type of unit:

– Do not want to use linear units as they only represent linear functions

– Do not want to use perceptrons as they are not differentiable (they have astep threshold) and so cannot use gradient descent

6

ML: ANNs - Multilayer Networks - Intro (2)

• Use sigmoid units:

– Output is a continuous function of the input:

o = σ(wx)

where

σ(y) =1

1 + e−y

and 0 ≤ o ≤ 1

– This is called a sigmoid, or logistic, function

– This is sometimes called the squashing function of the unit:

∗ It converts a large domain of values into a very small range

– The derivative can be expressed in terms of the output:

d(o(y))

dy= σ(y)(1− σ(y))

– Note: Other functions can be used instead of σ; e.g. e−ky

7

ML: ANNs - Multilayer Networks - The Back Propagation Algorithm

• Will consider ANNs with two layers with multiple outputs

– The first layer is referred to as the hidden layer

– It is connected to the inputs, and its outputs serve as inputs to the outputlayer

• The issue that needs to be addressed is: How should the weights of the hiddenlayer be adjusted?

– The technique to be described is stochastic gradient descent with back prop-agation

– Assume more than a single output node

• Let

E(w) =1

2

∑d∈D

∑k∈outputs

(tkd − okd)2

where

E(w) is the sum of the error over all of the output unitsoutputs is a set of output unitstkd is the target value of the kth unit when training with sample dokd is the output value of the kth unit when training with sample d

• A multilevel ANN can have local minima so stochastic gradient descent withback propagation is not guaranteed to converge on the global minimum

8

ML: ANNs - Multilayer Networks - The Back Propagation Algorithm (2)

• The algorithm:

– Assume two layers as discussed above

– Node refers to an input, output, or interior unit

– Training example has the form < x, t > where x is the feature values, andt is the output values

– Input from unit i to unit j denoted xji, weight from unit i to unit j denotedwji

– δn is error associated with node n:

δn =∂E

∂netn

which corresponds to t− o of the delta rule

– Numbers of nodes:

∗ Input: nin∗ Output: nout∗ Hidden: nhidden

initialize weights to random valuesfor each < x, t >∈ training set

propagate input forward thru network, computing ou for every network unit upropagate error backward thru network:

for each output unit k, calculate error δk:δk ← ok(1− ok)(tk − ok)

for each hidden unit h, calculate error δh:δh ← oh(1− oh)

∑k∈outputswkhδk

update each weight wji

wji ← wji + ∆wji

where ∆wji = ηδjxji

9

ML: ANNs - Multilayer Networks - The Back Propagation Algorithm (3)

• The primary difference between this and the earlier algorithm lies in the hiddenunits

– There is no direct way to measure their error

– What is done is to determine how much error a hidden unit contributes tothe output nodes

1. δk is summed for all of the output nodes influenced by hidden node h

2. Each is weighted by wkh

3. Then proceed as for an output node

• Termination triggered when

1. A predetermined number of iterations have been performed; or

2. The error falls below a predetermined threshold; or

3. the error of a validation set falls below a predetermined threshold

• One modification is to add a momentum term

– This gives a push to keep the change going in a given direction

– Based on the previous iteration:

∆wji(n) = ηδjxji + α∆wji(n− 1)

where ∆wji(n) represents the change in iteration n, and α is themomentum

– This helps to prevent getting stuck in local minima, and speeds convergence

• For ANNs with more than two layers, simply apply the weight update steprecursively:

δr ← or(1− oh)∑

s∈layerm+1

wsrδs

where unit r is in layer m

• If layers are not uniform, can generalize even further:

δr ← or(1− oh)∑

s∈downstreamr

wsrδs

10

ML: ANNs - The Back Propagation Algorithm: Gradient Rule Derivation

• For each training instance d, weight wji is updated by ∆wji where

∆wji = −η ∂Ed

∂wji

• Total error:

Ed(w) =1

2

∑k∈outputs

(tk − ok)2

• Representations:xji ith input to unit jwji weight of xjinetj weighted input of unit j (

∑iwjixji)

oj output of unit jtj target output for unit jσ sigmoid functionoutputs units in the output layerdownstreamj units with unit j as input

∂Ed

∂wji=

∂Ed

∂netj

∂netj∂wji

=∂Ed

∂netjxji

• Case 1: The output units

∂Ed

∂netj=∂Ed

∂oj

∂oj∂netj

since netj influences the network only thru oj

∂Ed

∂oj=

∂

∂oj

1

2

∑k∈outputs

(tk − ok)2

which is zero for all output units k except when k = jThus, the summation can be dropped:

∂Ed

∂oj=

∂

∂oj

[1

2(tj − oj)2

]= −(tj − oj)

11

ML: ANNs - The Back Propagation Algorithm: Gradient Rule Derivation (2)

Since oj = σ(netj),

∂oj∂netj

=∂σ(netj)

∂netj

= oj(1− oj)

The last line obtained by substituting oj for σ(netj in σ(netj)(1 −σ(netj), which was derived earlier

And finally,

∂Ed

∂oj= −(tj − oj)(oj(1− oj)

∆wji = −η ∂Ed

∂wji= η(tj − oj)oj(1− oj)xji

δk =∂Ed

∂netk

12

ML: ANNs - The Back Propagation Algorithm: Gradient Rule Derivation (3)

• Case 2: Hidden units

– netj only influences those nodes in downstream(j)

∂Ed

∂netj=

∑downstream(j)

∂Ed

∂netk

∂netk∂netj

=∑

downstream(j)

−δk∂netk∂netj

=∑

downstream(j)

−δk∂netk∂oj

∂oj∂netj

=∑

downstream(j)

−δkwkj∂oj∂netj

=∑

downstream(j)

−δkwkjoj(1− oj)

δj = − ∂Ed

∂netj

= oj(1− oj)∑

downstream(j)

δjwkj

∆wji = ηδjxji

13

ML: ANNs - Convergence

• As mentioned previously, there is no guarantee that stochastic gradient descentwill converge on the global minimum

– Local minima exist because the function is n-dimensional: one dimensionfor each weight in the network

– However, when one weight has a minimum, it is unlikely that others willtoo

∗ This tends to keep the ANN from getting trapped in a local minimum

• Note that the sigmoid function is close to linear near zero

– Only after the weights get larger does it become more non-linear and com-plex

– This is where the local minima tend to be found

– When gradient descent reaches these areas, it will have neared the globalminimum

• Techniques to deal with local minima:

1. Add a momentum term

– Discussed earlier

2. Use stochastic gradient descent instead of batch

– Stochastic gradient descent follows a different error surface for each train-ing sample

– Each surface will have different local minima and this precludes thelikelihood of getting stuck in any one

3. Train multiple times, each with different initial weights

– If different versions result in different results

∗ Can use the one with the best performance on the validation set, or

∗ Use all versions and use the averages of the outputs

14

ML: ANNs - Layers and Representations

• Different types of functions require different ANN architectures:

– Boolean functions

∗ Two layers

– Continuous functions

∗ Two layers:

· Output layer: linear units

· Hidden layer: sigmoid units

– Arbitrary functions

∗ Three layers

· Output layer: linear units

· Hidden layers: sigmoid units

15

ML: ANNs - Termination

• Earlier, listed three techniques for triggering termination

• Issues can arise if stop after the error falls below some threshold

– In this figure, the error in the validation set reaches a minimum and thenstarts to increase while the training set error monotonically decreases

∗ If halt training at minimum of validation error, is far from convergenceof the training set error

∗ This is the result of overfitting to idiosyncrasies of the training data

∗ As weights increase with extended training, the decision surface becomesmore complex and starts fitting the noise

– In this figure, there is a local minimum

∗ Stopping when the error starts to increase would be premature

16

ML: ANNs - Termination (2)

• Techniques to address these issues:

1. Weight decay

– Decrease each weight by a small amount each iteration

– This keeps the weights small and precludes complex surfaces

2. Use validation sets

– Iterate to get the lowest error in the validation set

– Must be careful not to stop at a local minimum

– Use the training set to generate weights using gradient descent

– A working set of training weights and a best-so-far set are maintained

– The best-so-far represent the weights that generate the best validationerror

• For small training sets, use cross validation

– Perform validation k times, each time with a different partition of trainingand validation samples

– Average the final results

– One approach takes m samples and divides them into k sets of size m/k

∗ For each of the k sets, use for validation, the other k − 1 for training

∗ The average number of iterations i is computed, then do one last run onall examples for i using no validation set iterations

17

ML: ANNs - Error Functions

• SOSE is not the only function that can be used to adjust weights

• Any function that is used must be differentiable WRT the parameterized hy-pothesis space

• For each definition/measure of error, a weight tuning rule must be derived

• Alternatives:

1. Add a penalty term for weight magnitude

– This term increases the magnitude of the weight vector

– Purpose is to generate smaller weights in gradient descent in an effortto reduce overfitting

– One strategy:

E(w) =1

2

∑d∈D

∑k∈outputs

(tkd − okd)2 + γ∑i,j

w2j,i

– This gives the same result as the back propagation rule with each weightmultiplied by (1− 2γη) on each iteration

2. Add a term for error in the slope (derivative) of the target function

– One strategy:

E(w) =1

2

∑d∈D

∑k∈outputs

(tkd − okd)2 + µ∑

j∈inputs

(∂tkd

∂xjd− ∂okd

∂xjd

)2

where

xjd represents the value of the jth input unit for training sample dThe derivatives represent the slopes of the target outputs andlearned outputs WRT input xjd, respectivelyµ is constant and represents the relative weight of the derivativesvs the training samples

18

ML: ANNs - Error Functions (2)

3. Minimizing cross entropy of the ANN WRT the target values

– This is appropriate when you want to output the probability of an outputfor a given input

– The best estimates are generated when the ANN minimizes the crossentropy, defined as

−∑d∈D

[tdlog(od) + (1− td)log(1− od)]

where

od is the probability estimate of the output for training sample d

4. Weight sharing

– This technique makes some weights identical

– Goal is to enforce some constraint known in advance

19

ML: ANNs - Alternative Error Minimization Procedures

• These techniques are alternatives to gradient descent

• There are two aspects to consider:

1. The direction of change

2. The magnitude of the change

1. Line search

– A direction is chosen for the weight update

– The amount of change is determined by minimizing the error along thisvector

2. Conjugate gradient method

– Perform a sequence of line searches along the error surface to find theminimum

– To begin, use the negated gradient

– On each subsequent step, choose a new direction that maintains the zerovalue of the component of the gradient that was just made zero

• Overall, these techniques provide no real advantages over gradient descent

20

ML: ANNs - Recurrent Networks

• A recurrent network is one in which output from one layer is input to a layerupstream from it

– I.e., current output affects future output

• Training can be accomplished using a modified back propagation algorithm.

21

ML: ANNs - Dynamic Network Structure

• A dynamic ANN can add and delete nodes

• The Cascade-Correlation algorithm starts with no hidden units

– If an undesirable degree of error exists after training, hidden unit is added

– Its weight is chosen to maximize the correlation between the hidden unitvalue and error

– This unit is connected to each output unit

– The process is repeated recursively until the overall error is acceptable

• The opposite approach is also used

– Start with a complex ANN

– Id weights that have little or no effect on the output (optimal brain damageapproach)

∗ Could id weights near zero

∗ Or id those where ∂E∂w is small

∗ Once id’d, these connections are eliminated

22

ML: ANNs - Fine Tuning

• This is complex since there are so many hyperparameters

• Ultimately, one hidden layer is sufficient to learn any function to a high degreeof accuracy

– The issue is efficiency

– The greater the number of layers, the fewer neurons needed (greater pa-rameter efficiency)

– Each layer in a multilayer network captures a different level of detail

∗ Those closest to the input layer capture the coarsest detail, while thoseclosest to the output the highest

∗ They create a hierarchy, where structure at a layer combines structurefrom the previous layer

– This also aids in building new networks

∗ If you want to build a network that is based on characteristics of onealready trained, you can use the saved weights and nodes of the latteras the foundation for the new network

∗ This saves training effort

– In general, start with a few layers and increase until overfitting occurs

• Number of neurons per layer

– One approach is to use a pyramidal shape, with the lowest layer having thegreatest number, and the highest the least

– Using the same number at each level decreases the number of parametersto worry bout (one v one per level)

– A general technique is to start with a large number of layers and nodes anduse early stopping to preclude overfitting

23

ML: ANNs - Fine Tuning (2)

• Activation functions

– ReLU is generally a good choice for hidden layers

∗ ReLU(z) = max(0, z)

∗ Fast to compute and alleviates some issues with gradient descent

∗ But not differentiable at z = 0

– For output layer, Softmax good for classification

– Can omit a function if doing regression

24

ml: anns - intro arti cial neural networks (anns) have a ...djmoon/ml/ml-notes/ml-anns.pdf · {...

Documents