neural networks - ucf computer sciencegqi/cap5610/cap5610lecture09.pdf•a three layer neural...

Neural NetworksCAP5610 Machine Learning

Instructor: Guo-Jun Qi

Recap: linear classifier

• Logistic regression• Maximizing the posterior distribution of class Y conditional on the input

vector X

• Support vector machines• Maximizing the maximum margin, and

• Hard margin: subject to the constraints that no training error shall be made• Soft margin: minimizing the slack variables that represent how much an associated

training example violates the classification rule.

• Extended to nonlinear classifier with kernel trick• Mapping input vectors to high dimensional space • linear classifier in high dimensional space, nonlinear in original space.

Building nonlinear classifier

• With a network of logistic units?

• A single logistic unit is linear classifier : f: X → Y

𝑓 𝑋 =1

1 + exp(−𝑊0 − 𝑛=1𝑁 𝑊𝑛𝑋𝑛)

Graph representation of a logistic unit

• Input layer: An input X=(X1, …, Xn)

• Output: logistic function of the input features

A logistic unit as an neuron:

• Input layer: An input X=(X1, …, Xn)

• Activation: weighted sum of input features 𝑎 = 𝑊0 + 𝑛=1𝑁 𝑊𝑛𝑋𝑛

• Activation function: logistic function h applied to the weighted sum

• Output: z = ℎ(𝑎)

Neural Network: Multiple layers of neurons

• Output of a layer is the input into the upper layer

An example

• A three layer neural network

𝑥1 𝑥2

𝑧1 𝑧2

𝑦1 𝑎1(1)= 𝑤11(1)𝑥1 + 𝑤12

(1)𝑥2 + 𝑤10

𝑧1 = ℎ(𝑎1(1)

𝑎2(1)= 𝑤21(1)𝑥1 + 𝑤22

(1)𝑥2 + 𝑤20

𝑧2 = ℎ(𝑎2(1)

𝑎1(2)= 𝑤11(2)𝑧1 + 𝑤12

(2)𝑥2 +𝑤10

𝑦1 = 𝑓(𝑎1(2))

𝑤11(1)

𝑤21(1) 𝑤12

(1)𝑤22(1)

𝑤11(2)

𝑤12(2)

𝑤10(1)

𝑤20(1)

𝑤10(2)

𝑎1(2)

𝑎1(1) 𝑎2

XOR Problem• It is impossible to linearly separate these two classes

(0,0) (1,0)

XOR Problem

• Two classes become separable by putting a threshold 0.5 to the output y1

(0,0) (1,0)

𝑥1 𝑥2

𝑧1 𝑧2

−11.62 12.88 10.99 −13.13

13.34 13.13

−6.06

−6.56

−7.19

Input Output

(0,0) 0.057

(1,0) 0.949

(0,1) 0.946

(1,1) 0.052

Application: Drive a car

• Input: real-time videos captured by a camera

• Output: signals that steer a car• From the sharp left, straight to sharp right

Training Neural Network

• Given a training set of M examples {(x(i),t(i))|i=1,…,M}

• Training neural network is equivalent to minimizing the least square error between the network output and the true value

min𝑤 𝐿 𝑤 =1

𝑖=1

(𝑦(𝑖) − 𝑡 𝑖 )2

Where y(i) is the output depending on the network parameters w.

Recap: Gradient decent Method

• Gradient descent method is an iterative algorithm• hill climbing method to find the peak point of a “mountain”

• At each point, compute its gradient

• Gradient is a vector that points to the steepest direction climbing up the mountain.

• At each point, w is updated so it moves a size of step λ in the gradient direction

, ,...,N

L L LL

( )w w L w

Stochastic Gradient Ascent Method

• Making the learning algorithm scalable to big data

• Computing the gradient of square error for only one example

𝐿 𝑤 =

𝑖=1

(𝑦(𝑖) − 𝑡 𝑖 )2

𝐿(𝑖) 𝑤 = (𝑦(𝑖) − 𝑡 𝑖 )2

, ,...,N

L L LL

( ) ( ) ( )( )

, ,...,i i i

L L LL

Boiling down to computing the gradient

𝑥1 𝑥2

𝑧1 𝑧2

𝑤11(2)

𝑤12(2)

𝑤10(2) 21

(y t )2

(2) (2) (2)

ki k ki

Square loss:

(2) (2)y ( ),k k k kj j

f a a w z

Derivative to the activation in the second layer:

(2) (2)

(2) (2)(y ) (y ) '( )k

k k k k k k

yLt t f a

𝑎1(2)

Derivative to the parameter in the second layer:

Boiling down to computing the gradient

• Computing the derivatives to the parameters in the first layer

𝑥1 𝑥2

𝑧1 𝑧2

𝑤11(1)

𝑤21(1)

𝑤12(1) 𝑤22

𝑤11(2)

𝑤12(2)

𝑤10(1)

𝑤10(2)

(1) (1)

j jn n

By chain rule:(2)

(1) (2) (1)

(1) (2) (2) (1)'( )

kj k j

j k kj j

(2) (2) (1)( )k kj j

a w h a

𝑎1(2)

Relation between activations of the first and second layers

(1) (1) (1)

jn j jn

The derivative to the parameter in the first layer:

Summary: Back propagation

• For each training example (x,y),

• For each output unit k

• For each hidden unit j

(2) (2)(y ) '( )k k k kt f a

(1) (1) (2) (2)'( )j j k kj

𝛿1(2)

𝛿1(1)

𝛿2(1)

Summary: Back propagation (2)

• For each weight 𝑤𝑘𝑖(2)

• Update

• For each weight 𝑤𝑗𝑛(1)

• Update

𝛿1(2)

𝛿1(1)

𝛿2(1)

𝜕𝐿

𝜕𝑤𝑘𝑖(2)= 𝛿𝑘(2)𝑧𝑖

𝜕𝐿

𝜕𝑤𝑗𝑛(1)= 𝛿𝑗(1)𝑥𝑛

𝑤𝑘𝑖(2)← 𝑤𝑘𝑖(2)− 𝛿𝑘(2)𝑧𝑖

𝑤𝑗𝑛(1)← 𝑤𝑗𝑛(1)− 𝛿𝑗(1)𝑥𝑛

Regularized Square Error

• Add a zero mean Gaussian prior on the weights 𝑤𝑖𝑗(𝑙)~𝑁(0, 𝜎2)

• MAP estimate of w

𝐿(𝑖) 𝑤 =1

2(𝑦(𝑖) − 𝑡 𝑖 )2 +

2 (𝑤𝑖𝑗

(𝑙))2

Summary: Back propagation (2)

• For each weight 𝑤𝑘𝑖(2)

• Update

• For each weight 𝑤𝑗𝑛(1)

• Update

𝛿1(2)

𝛿1(1)

𝛿2(1)

𝜕𝐿

𝜕𝑤𝑘𝑖(2)= 𝛿𝑘(2)𝑧𝑖 + 𝛾𝑤𝑘𝑖

𝜕𝐿

𝜕𝑤𝑗𝑛(1)= 𝛿𝑗(1)𝑥𝑛 + 𝛾𝑤𝑗𝑛

𝑤𝑘𝑖(2)← 𝑤𝑘𝑖(2)− 𝛿𝑘(2)𝑧𝑖 − 𝛾𝑤𝑘𝑖

𝑤𝑗𝑛(1)← 𝑤𝑗𝑛(1)− 𝛿𝑗(1)𝑥𝑛 − 𝛾𝑤𝑗𝑛

Multiple outputs encoding multiple classes

• MNIST: ten classes of digits

• Encoding multiple classes as multiple outputs:• An output variable is set to 1 if the corresponding class is positive for the example

• Otherwise, the output is set to 0.

• The posterior probability of an example belonging to class k

𝑃 Class𝑘|𝐱 =𝑦𝑘

𝑘′=1𝐾 𝑦𝑘′

Overfitting

• Tuning the number of update iterations on validation set

How expressive is NN?

• Boolean functions:• Every Boolean function can be represented by network with single hidden

• But might require exponential number of hidden units

• Continuous functions:• Every bounded continuous function can be approximated with arbitrarily

small error by neural network with one hidden layer

• Any function can be approximated to arbitrary accuracy by a network with two hidden layers

Learning feature representation by neural networks

• A compact representation for high dimensional input vectors• A large image with thousands of pixels

• High dimensional input vectors • might cause curse of dimensionality

• Needs more examples for training (in lecture 1)

• Not well capture the intrinsic variations• an arbitrary point in a high dimensional space probably does not represent a valid real

object.

• A meaningful low dimensional space is preferred!

Autoencoder

• Set output to input

• Hidden layers as feature representation, since it contains sufficient information to reconstruct the input in the output layer

An example

Deep Learning: A Deep Feature Representation

• If you build multiple layers to reconstruct the input at the output layer

Summary

• Neural Networks: Multiple layers of neurons • Each upper layer neuron encodes weighted sum of inputs from the other

neurons at a lower layer by an activation function

• BP training: a stochastic gradient descent method• From the output layer down to the hidden and input layers

• Autoencoder: feature representation

neural networks - ucf computer sciencegqi/cap5610/cap5610lecture09.pdf•a three layer neural...

Documents

representation power of feedforward neural...

explicit object representation by sparse neural...

representation of spatial transformations in deep neural...

flexible neural representation for physics prediction ·...

brain the neural representation of nouns and verbs: pet

the polyphonic brain: extracting the neural representation

on 1/n neural representation and robustness

convolutional neural networks are good at representation

representation learning via quantum neural tangent kernels

learning convolutional neural networks for graphs › assets...

evaluating layers of representation in neural...

representation power of feedforward neural networks ·...

neural representation of subjective value under risk and...

neural discrete representation learning - arxiv · neural...

dysat: deep neural representation learning on dynamic...

neural networks: representation...

quantitative modeling of the neural representation of

equivalence in knowledge representation: automata, recurrent...

the representation of language in bilinguals: neural...

neural discrete representation learning · neural discrete...