supervised learning: artificial neural networkseugene/cs572/lectures/03302016-anns... ·...

Supervised Learning: Artificial Neural Networks

Some slides adapted from Dan Klein et al. (Berkeley) and Percy Liang (Stanford)

• Project 4: http://www.mathcs.emory.edu/~eugene/cs325/p4/

• Review: Perceptron, MIRA

• New: Neural Networks (Supervised version)

– +Hidden Layer

– +Training algorithms (forward-, backword-propagation)

3/30/2016 CS325: Artificial Intelligence, Spring 2016 2

Single Neuron as Linear Classifier

• Inputs are feature values

• Each feature has a weight

• Sum is the activation

• If the activation is:– Positive, output +1

– Negative, output -1

f1f2f3

Weights

• Binary case: compare features to a weight vector

• Learning: figure out the weight vector from examples

# free : 2

YOUR_NAME : 0

MISSPELLED : 2

FROM_FRIEND : 0

# free : 4

YOUR_NAME :-1

MISSPELLED : 1

FROM_FRIEND :-3

# free : 0

YOUR_NAME : 1

MISSPELLED : 1

FROM_FRIEND : 1

Dot product

positive means the

positive class

Binary Decision Rule

• In the space of feature vectors

– Examples are points

– Any weight vector is a hyperplane

– One side corresponds to Y=+1

– Other corresponds to Y=-1

BIAS : -3

free : 4

money : 2

...0 1

+1 = SPAM

-1 = HAM

Learning: Binary Perceptron

• Start with weights = 0

• For each training instance:

– Classify with current weights

– If correct (i.e., y=y*), no change!

– If wrong: adjust the weight vector

Learning: Binary Perceptron

• Start with weights = 0

• For each training instance:

– Classify with current weights

– If correct (i.e., y=y*), no change!

– If wrong: adjust the weight vector by adding or subtracting the feature vector. Subtract if y* is -1.

Multiclass Decision Rule

• If we have multiple classes:– A weight vector for each class:

– Score (activation) of a class y:

– Prediction highest score wins

Binary = multiclass where the negative class has weight zero

Learning: Multiclass Perceptron

• Start with all weights = 0

• Pick up training examples one by one

• Predict with current weights

• If correct, no change!

• If wrong: lower score of wrong answer, raise score of right answer

MIRA*: Fixing the Perceptron

• What about “outliers” – correct or incorrect data points with abnormal feature values?

• MIRA: choose an update size that fixes the current mistake…

• … but, minimizes the change to w

• The +1 helps to generalize

* Margin Infused Relaxed Algorithm

Minimum Correcting Update

min not =0, or would not have made an error, so min will be where equality holds

Maximum Step Size

• In practice, it’s also bad to make updates that are too large

– Example may be labeled incorrectly

– You may not have enough features

– Solution: cap the maximum possible value of with some constant C

– Corresponds to an optimization that assumes non-separable data

– Usually converges faster than perceptron

– Usually better, especially on noisy data

Linear Separators

• Which of these linear separators is optimal?

Support Vector Machines

• Maximizing the margin: good according to intuition, theory, practice

• Only support vectors matter; other training examples are ignorable

• Support vector machines (SVMs) find the separator with max margin

• Basically, SVMs are MIRA where you optimize over all examples at once

Classification: Comparison

• Naïve Bayes– Builds a model training data

– Gives prediction probabilities

– Strong assumptions about feature independence

– One pass through data (counting)

• Perceptrons / MIRA:– Makes less assumptions about data

– Mistake-driven learning

– Multiple passes through data (prediction)

– Often more accurate

Different Non-Linearly

Separable Problems

StructureTypes of

Decision Regions

Exclusive-OR

Problem

Classes with

Meshed regions

Most General

Region Shapes

Single-Layer

Two-Layer

Three-Layer

Half Plane

Bounded By

Hyperplane

Convex Open

Closed Regions

Arbitrary

(Complexity

Limited by No.

of Nodes)

Non-Linear Separable Data

• Demo: http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

layer_defs = [];

layer_defs.push({type:'input', out_sx:1, out_sy:1, out_depth:2});

layer_defs.push({type:'fc', num_neurons:1, activation: 'sigmoid'});

layer_defs.push({type:'softmax', num_classes:2});

net = new convnetjs.Net();

net.makeLayers(layer_defs);

trainer = new convnetjs.SGDTrainer(net, {learning_rate:0.01, momentum:0.1,

batch_size:10, l2_decay:0.001});

Multilayer Perceptron (MLP)

Output Values

Input Signals (External Stimuli)

Output Layer

Adjustable

Weights

Input Layer

The Key Elements of Neural Networks

• Neural computing requires a number of neurons, to be connected together into a neural network. Neurons are arranged in layers.

• Each neuron within the network is usually a simple processing unit which takes one or more inputs and produces an output. At each neuron, every input has an associated weight which modifies the strength of each input. The neuron simply adds together all the inputs and calculates an output to be passed on.

Inputs Weights

Output

bwpfbwpwpwpfa ii332211

Activation functions

Sigmoid Unit

net=i=0n wi xi

o=(net)=1/(1+e-net)

(x) is the sigmoid function: 1/(1+e-x)

d(x)/dx= (x) (1- (x))

Derive gradient decent rules to train:

• one sigmoid function

E/wi = -d(td-od) od (1-od) xi

• Multilayer networks of sigmoid units

backpropagation:

Online Demo: Add 2nd Layer

• Demo: http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

layer_defs = [];

layer_defs.push({type:'input', out_sx:1, out_sy:1, out_depth:2});

layer_defs.push({type:'fc', num_neurons:5, activation: 'sigmoid'});

layer_defs.push({type:'softmax', num_classes:2});

net = new convnetjs.Net();

net.makeLayers(layer_defs);

trainer = new convnetjs.SGDTrainer(net, {learning_rate:0.01, momentum:0.1,

batch_size:10, l2_decay:0.001});

2-Layer Neural Network

Hidden Layers

• Intermediate hidden units: learned features– activation vector h behaves a like our feature vector f(x) for

linear classifier. But, h is “learned” automatically.

• What kind of features that can be learned? – must be of the form x

Feedforword NNs

• The basic structure off a feedforward Neural Network

• When the desired output are known we have supervised learning or learning with a teacher.

(Feed-forward) Neural Network

Output

Weights

Output

Stack many non-linear units (neurons)

Layer I

(input)

Layer 2

(hidden)

Layer 3

(output)

wij- weight between i-th

input and j-th hidden

Neural networks

• Very expressive– Able to learn highly non-linear functions

• Supervised training– Binary classification

– Multi-class classification

– Regression / composite loss

• Unsupervised training– Dimensionality reduction

– Complex sequence modeling

– High level feature learning

Backpropagation Algorithm – Main

Idea – error in hidden layers

The ideas of the algorithm can be summarized as follows :

1. Computes the error term for the output units using the

observed error.

2. From output layer, repeat

- propagating the error term back to the previous layer

- updating the weights between the two layers

until the earliest hidden layer is reached.

Backpropagation Algorithm

• Initialize weights (typically random!)

• Keep doing epochs

– For each example e in training set do

• forward pass to compute– O = neural-net-output(network,e)

– miss = (T-O) at each output unit

• backward pass to calculate deltas to weights

• update all weights

– end

• until tuning set error stops improving

Backward pass explained in next slideForward pass explained

earlier

Backward Pass

• Compute deltas to weights

– from hidden layer

– to output layer

• Without changing any weights (yet), compute the actual contributions

– within the hidden layer(s)

– and compute deltas

Training Learning Details

• Method for learning weights in feed-forward (FF) nets

• Can’t use Perceptron Learning Rule– no teacher values are possible for hidden units

• Use gradient descent to minimize the error– propagate deltas to adjust for errors

backward from outputsto hidden layers

to inputs

forward

backward

Error Surface

Error as function of

weights in

multidimensional space

weights

Gradient

• Trying to make error decrease the fastest

• Compute:• GradE = [dE/dw1, dE/dw2, . . ., dE/dwn]

• Change i-th weight by• deltawi = -alpha * dE/dwi

• We need a derivative!

• Activation function must be continuous, differentiable, non-decreasing, and easy to compute

Derivatives of error for weights

Compute

deltas

Can’t use LTU

• To effectively assign credit / blame to units in hidden layers, we want to look at the first derivativeof the activation function

• Sigmoid function is easy to differentiate and easy to compute forward

Linear Threshold Units Sigmoid function

Training: Loss Function

Output

Layer I

(input)

Layer 2

(hidden)

Layer 3

(output)

Loss function

Output

Layer I

(input)

Layer 2

(hidden)

Layer 3

(output)

Loss function

Loss functions

• Classification

– Cross entropy (binary and multiclass)

• Regression

– Squared deviation (mean squared error)

Output

Layer I

(input)

Layer 2

(hidden)

Layer 3

(output)

Loss function

(2) a1

Training: Forward Pass

Output

Layer I

(input)

Layer 2

(hidden)

Layer 3

(output)

Loss function

Layer k-1 Layer k

Output

Layer I

(input)

Layer 2

(hidden)

Layer 3

(output)

Output

Layer I

(input)

Layer 2

(hidden)

Layer 3

(output)

Loss function

Backpropagation Network training

• 1. Initialize network with random weights

• 2. For all training cases (called examples):– a. Present training inputs to network and calculate

output

– b. For all layers (starting with output layer, back to input layer):

• i. Compare network output with correct output

(error function)

• ii. Adapt weights in current layer

This is

The Learning Rule: Backpropagation

• The delta rule is often utilized by the most common class of ANNs called backpropagational neural networks.

Desired

Output

Training: Back-propagation

• Use gradient descent optimization

f (x) =s (W (2)s (W (1)x+b(1))+b(2))

Loss(x, y;W,b) = y- f (x)2

Training: Back-propagation

• Use gradient descent optimization

f (x) =s (W (2)s (W (1)x+b(1))+b(2))

Loss(x, y;W,b) = y- f (x)2

What is back-propagation?

• An algorithm to compute (W,b) gradients

Chain Rule for Derivatives

g(x) f(g(x))

f(g1(x) + g2(x))

f(g1(x) + … + gk(x))

Output

Layer I

(input)

Layer 2

(hidden)

Layer 3

(output)

Loss function

Gradients of W’s and b’s

How do we pick ?

1. Tuning set, or

2. Cross validation, or

3. Small for slow, conservative learning

How many hidden layers?

• Usually just one (i.e., a 2-layer net)

• How many hidden units in the layer?

– Too few ==> can’t learn

– Too many ==> poor generalization

How big a training set?

• Determine your target error rate, e

• Success rate is 1- e

• Typical training set approx. n/e, where n is the number of weights in the net

• Example:

– e = 0.1, n = 80 weights

– training set size 800

trained until 95% correct training set classification

should produce 90% correct classification

on testing set (typical)

Back-propagation: Summary

• For each input (x,y)

• Compute forward pass (values of hidden units at all layers) using training data (x,y)

• For each layer

– Compute gradients of outputs w.r.t. inputs

• Compute

• Update

Neural Networks: Summary

Output

Layer 1 Layer 2

Neural Networks: Applications

• Image Modeling / Classification

• Feature Learning (Deep Learning; many layers)

• Compositional Text Representation

• Sequence Modeling

supervised learning: artificial neural networkseugene/cs572/lectures/03302016-anns... ·...

Documents

intech-the use of artificial neural networks anns in aquatic...

interpretation of artificial neural networks: mapping...

artificial neural networks lect2: neurobiology &...

neural networks in medical image processing - ijrmas 1/issue...

speed control of separately excited dc motor using ... ·...

rainfall-runoff modeling using artificial neural...

chapter 1 introduction - marquette university · the third...

arxiv:1712.02840v1 [cond-mat.stat-mech] 7 dec 2017a. arti...

artificial neural networks (anns) for daily rainfall

artiﬁcial neural networks and multiple linear regression...

neural population control via deep image synthesis · pouya...

introduction to artiﬁcial neural networks (anns)

bio-inspired networkingkato/workshop2009/01.pdf · cellular...

introduction to artificial neural networks (anns) ·...

chapter 4 artificial neural networks. questions: what is...

artificial neural networks in accounting and finance ... ·...

module 3 - deepak d. · 2019. 9. 16. · module 3...

ml: anns - intro arti cial neural networks (anns) have a...

introduction to artificial neural networks (ann) methods: ·...

neuromorphic electronic systems for reservoir...