supervised learning: artificial neural networkseugene/cs572/lectures/03302016-anns... ·...

Post on 24-May-2020

9 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Supervised Learning: Artificial Neural Networks

Some slides adapted from Dan Klein et al. (Berkeley) and Percy Liang (Stanford)

Plan

• Project 4: http://www.mathcs.emory.edu/~eugene/cs325/p4/

• Review: Perceptron, MIRA

• New: Neural Networks (Supervised version)

– +Hidden Layer

– +Training algorithms (forward-, backword-propagation)

3/30/2016 CS325: Artificial Intelligence, Spring 2016 2

Single Neuron as Linear Classifier

• Inputs are feature values

• Each feature has a weight

• Sum is the activation

• If the activation is:– Positive, output +1

– Negative, output -1

f1f2f3

w

1w

2w

3

>0?

Weights

• Binary case: compare features to a weight vector

• Learning: figure out the weight vector from examples

# free : 2

YOUR_NAME : 0

MISSPELLED : 2

FROM_FRIEND : 0

...

# free : 4

YOUR_NAME :-1

MISSPELLED : 1

FROM_FRIEND :-3

...

# free : 0

YOUR_NAME : 1

MISSPELLED : 1

FROM_FRIEND : 1

...

Dot product

positive means the

positive class

Binary Decision Rule

• In the space of feature vectors

– Examples are points

– Any weight vector is a hyperplane

– One side corresponds to Y=+1

– Other corresponds to Y=-1

BIAS : -3

free : 4

money : 2

...0 1

0

1

2

free

mo

ney

+1 = SPAM

-1 = HAM

Learning: Binary Perceptron

• Start with weights = 0

• For each training instance:

– Classify with current weights

– If correct (i.e., y=y*), no change!

– If wrong: adjust the weight vector

Learning: Binary Perceptron

• Start with weights = 0

• For each training instance:

– Classify with current weights

– If correct (i.e., y=y*), no change!

– If wrong: adjust the weight vector by adding or subtracting the feature vector. Subtract if y* is -1.

Multiclass Decision Rule

• If we have multiple classes:– A weight vector for each class:

– Score (activation) of a class y:

– Prediction highest score wins

Binary = multiclass where the negative class has weight zero

Learning: Multiclass Perceptron

• Start with all weights = 0

• Pick up training examples one by one

• Predict with current weights

• If correct, no change!

• If wrong: lower score of wrong answer, raise score of right answer

MIRA*: Fixing the Perceptron

• What about “outliers” – correct or incorrect data points with abnormal feature values?

• MIRA: choose an update size that fixes the current mistake…

• … but, minimizes the change to w

• The +1 helps to generalize

* Margin Infused Relaxed Algorithm

wtf?

Minimum Correcting Update

min not =0, or would not have made an error, so min will be where equality holds

Maximum Step Size

• In practice, it’s also bad to make updates that are too large

– Example may be labeled incorrectly

– You may not have enough features

– Solution: cap the maximum possible value of with some constant C

– Corresponds to an optimization that assumes non-separable data

– Usually converges faster than perceptron

– Usually better, especially on noisy data

Linear Separators

• Which of these linear separators is optimal?

Support Vector Machines

• Maximizing the margin: good according to intuition, theory, practice

• Only support vectors matter; other training examples are ignorable

• Support vector machines (SVMs) find the separator with max margin

• Basically, SVMs are MIRA where you optimize over all examples at once

MIRA

SVM

Classification: Comparison

• Naïve Bayes– Builds a model training data

– Gives prediction probabilities

– Strong assumptions about feature independence

– One pass through data (counting)

• Perceptrons / MIRA:– Makes less assumptions about data

– Mistake-driven learning

– Multiple passes through data (prediction)

– Often more accurate

16

Different Non-Linearly

Separable Problems

StructureTypes of

Decision Regions

Exclusive-OR

Problem

Classes with

Meshed regions

Most General

Region Shapes

Single-Layer

Two-Layer

Three-Layer

Half Plane

Bounded By

Hyperplane

Convex Open

Or

Closed Regions

Arbitrary

(Complexity

Limited by No.

of Nodes)

A

AB

B

A

AB

B

A

AB

B

BA

BA

BA

Non-Linear Separable Data

• Demo: http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

3/30/2016 CS325: Artificial Intelligence, Spring 2016 17

layer_defs = [];

layer_defs.push({type:'input', out_sx:1, out_sy:1, out_depth:2});

layer_defs.push({type:'fc', num_neurons:1, activation: 'sigmoid'});

layer_defs.push({type:'softmax', num_classes:2});

net = new convnetjs.Net();

net.makeLayers(layer_defs);

trainer = new convnetjs.SGDTrainer(net, {learning_rate:0.01, momentum:0.1,

batch_size:10, l2_decay:0.001});

18

Multilayer Perceptron (MLP)

Output Values

Input Signals (External Stimuli)

Output Layer

Adjustable

Weights

Input Layer

Slide 19

The Key Elements of Neural Networks

• Neural computing requires a number of neurons, to be connected together into a neural network. Neurons are arranged in layers.

• Each neuron within the network is usually a simple processing unit which takes one or more inputs and produces an output. At each neuron, every input has an associated weight which modifies the strength of each input. The neuron simply adds together all the inputs and calculates an output to be passed on.

Inputs Weights

Output

Bias1

3p

2p

1p

fa

3w

2w

1w

bwpfbwpwpwpfa ii332211

Slide 20

Activation functions

21

Sigmoid Unit

x1

x2

xn

.

.

.

w1

w2

wn

w0

x0=1

net=i=0n wi xi

o

o=(net)=1/(1+e-net)

(x) is the sigmoid function: 1/(1+e-x)

d(x)/dx= (x) (1- (x))

Derive gradient decent rules to train:

• one sigmoid function

E/wi = -d(td-od) od (1-od) xi

• Multilayer networks of sigmoid units

backpropagation:

Online Demo: Add 2nd Layer

• Demo: http://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

3/30/2016 CS325: Artificial Intelligence, Spring 2016 22

layer_defs = [];

layer_defs.push({type:'input', out_sx:1, out_sy:1, out_depth:2});

layer_defs.push({type:'fc', num_neurons:5, activation: 'sigmoid'});

layer_defs.push({type:'softmax', num_classes:2});

net = new convnetjs.Net();

net.makeLayers(layer_defs);

trainer = new convnetjs.SGDTrainer(net, {learning_rate:0.01, momentum:0.1,

batch_size:10, l2_decay:0.001});

2-Layer Neural Network

3/30/2016 CS325: Artificial Intelligence, Spring 2016 23

Hidden Layers

• Intermediate hidden units: learned features– activation vector h behaves a like our feature vector f(x) for

linear classifier. But, h is “learned” automatically.

• What kind of features that can be learned? – must be of the form x

3/30/2016 CS325: Artificial Intelligence, Spring 2016 24

Slide 25

Feedforword NNs

• The basic structure off a feedforward Neural Network

• When the desired output are known we have supervised learning or learning with a teacher.

(Feed-forward) Neural Network

x1

x2

x3

+1

Σ

w1

w2

w3

b

Output

(Feed-forward) Neural Network

x1

x2

x3

+1

Σ

w1

w2

w3

b

Output

Weights

bias

(Feed-forward) Neural Network

x1

x2

x3

+1

a2

w11

w21

Output

Stack many non-linear units (neurons)

a2

+1

Layer I

(input)

Layer 2

(hidden)

Layer 3

(output)

wij- weight between i-th

input and j-th hidden

unit

Neural networks

• Very expressive– Able to learn highly non-linear functions

• Supervised training– Binary classification

– Multi-class classification

– Regression / composite loss

• Unsupervised training– Dimensionality reduction

– Complex sequence modeling

– High level feature learning

Backpropagation Algorithm – Main

Idea – error in hidden layers

The ideas of the algorithm can be summarized as follows :

1. Computes the error term for the output units using the

observed error.

2. From output layer, repeat

- propagating the error term back to the previous layer

and

- updating the weights between the two layers

until the earliest hidden layer is reached.

Backpropagation Algorithm

• Initialize weights (typically random!)

• Keep doing epochs

– For each example e in training set do

• forward pass to compute– O = neural-net-output(network,e)

– miss = (T-O) at each output unit

• backward pass to calculate deltas to weights

• update all weights

– end

• until tuning set error stops improving

Backward pass explained in next slideForward pass explained

earlier

Backward Pass

• Compute deltas to weights

– from hidden layer

– to output layer

• Without changing any weights (yet), compute the actual contributions

– within the hidden layer(s)

– and compute deltas

Training Learning Details

• Method for learning weights in feed-forward (FF) nets

• Can’t use Perceptron Learning Rule– no teacher values are possible for hidden units

• Use gradient descent to minimize the error– propagate deltas to adjust for errors

backward from outputsto hidden layers

to inputs

forward

backward

Error Surface

Error as function of

weights in

multidimensional space

error

weights

Gradient

• Trying to make error decrease the fastest

• Compute:• GradE = [dE/dw1, dE/dw2, . . ., dE/dwn]

• Change i-th weight by• deltawi = -alpha * dE/dwi

• We need a derivative!

• Activation function must be continuous, differentiable, non-decreasing, and easy to compute

Derivatives of error for weights

Compute

deltas

Can’t use LTU

• To effectively assign credit / blame to units in hidden layers, we want to look at the first derivativeof the activation function

• Sigmoid function is easy to differentiate and easy to compute forward

Linear Threshold Units Sigmoid function

Training: Loss Function

x1

x2

x3

+1

a2

w11

w21

Output

Stack many non-linear units (neurons)

a2

+1

Layer I

(input)

Layer 2

(hidden)

Layer 3

(output)

Loss function

Training: Loss Function

x1

x2

x3

+1

a2

w11

w21

Output

Stack many non-linear units (neurons)

a2

+1

Layer I

(input)

Layer 2

(hidden)

Layer 3

(output)

Loss function

Loss functions

• Classification

– Cross entropy (binary and multiclass)

• Regression

– Squared deviation (mean squared error)

Training: Loss Function

x1

x2

x3

+1

w11

w21

Output

+1

Layer I

(input)

Layer 2

(hidden)

Layer 3

(output)

Loss function

a1

(2)

a1

(2)

a2

(2) a1

(3)

Training: Forward Pass

x1

x2

x3

+1

Output

+1

Layer I

(input)

Layer 2

(hidden)

Layer 3

(output)

Loss function

Training: Forward Pass

+1

+1

Layer k-1 Layer k

Training: Forward Pass

x1

x2

x3

+1

Output

Stack many non-linear units (neurons)

+1

Layer I

(input)

Layer 2

(hidden)

Layer 3

(output)

Training: Forward Pass

x1

x2

x3

+1

w11

w21

Output

Stack many non-linear units (neurons)

+1

Layer I

(input)

Layer 2

(hidden)

Layer 3

(output)

Loss function

Backpropagation Network training

• 1. Initialize network with random weights

• 2. For all training cases (called examples):– a. Present training inputs to network and calculate

output

– b. For all layers (starting with output layer, back to input layer):

• i. Compare network output with correct output

(error function)

• ii. Adapt weights in current layer

This is

what

you

want

Slide 46

The Learning Rule: Backpropagation

• The delta rule is often utilized by the most common class of ANNs called backpropagational neural networks.

Input

Desired

Output

Training: Back-propagation

• Use gradient descent optimization

f (x) =s (W (2)s (W (1)x+b(1))+b(2))

Loss(x, y;W,b) = y- f (x)2

Training: Back-propagation

• Use gradient descent optimization

f (x) =s (W (2)s (W (1)x+b(1))+b(2))

Loss(x, y;W,b) = y- f (x)2

What is back-propagation?

• An algorithm to compute (W,b) gradients

Chain Rule for Derivatives

x g f

g(x) f(g(x))

Chain Rule for Derivatives

x

g1

f

f(g1(x) + g2(x))

g2

Chain Rule for Derivatives

x

g1

f

f(g1(x) + … + gk(x))

gk

Training: Forward Pass

x1

x2

x3

+1

w11

w21

Output

Stack many non-linear units (neurons)

+1

Layer I

(input)

Layer 2

(hidden)

Layer 3

(output)

Loss function

Gradients of W’s and b’s

How do we pick ?

1. Tuning set, or

2. Cross validation, or

3. Small for slow, conservative learning

How many hidden layers?

• Usually just one (i.e., a 2-layer net)

• How many hidden units in the layer?

– Too few ==> can’t learn

– Too many ==> poor generalization

How big a training set?

• Determine your target error rate, e

• Success rate is 1- e

• Typical training set approx. n/e, where n is the number of weights in the net

• Example:

– e = 0.1, n = 80 weights

– training set size 800

trained until 95% correct training set classification

should produce 90% correct classification

on testing set (typical)

Back-propagation: Summary

• For each input (x,y)

• Compute forward pass (values of hidden units at all layers) using training data (x,y)

• For each layer

– Compute gradients of outputs w.r.t. inputs

• Compute

• Update

Neural Networks: Summary

x1

x2

x3

+1

Σ

w1

w2

w3

b

Output

Layer 1 Layer 2

Neural Networks: Applications

• Image Modeling / Classification

• Feature Learning (Deep Learning; many layers)

• Compositional Text Representation

• Sequence Modeling

top related