deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · deep neural networks:...

Deep neural networks: the basicsMaster’s Computer Vision

Sergey Nikolenko, Alex Davydow

Harbour Space University, BarcelonaMay 19, 2020

Random facts:

• On May 19, 1655, the Anglo-Spanish War began with an invasion of Jamaica; soon a ter takingthe island the English were starving and dying of disease; still, they persevered, Jamaica wasceded at the Treaty of Madrid and soon became a hugely profitable sugar colony

• On May 19, 1910, Earth passed through the tail of Halley’s Comet; anti-comet pills andanti-comet umbrellas were widely sold to the public; Mark Twain was born two weeks a terthe previous perihelion and died the day a ter the perihelion in 1910

• On May 19, 1980, Apple announced Apple III, the first failure of Apple; the hardware o tenfailed, there were problems with cooling and board design; Steve Wozniak estimated thatApple “put $100 million in advertising, promotion, and research and development into aproduct that was 3 percent of our revenues”

• On May 19, 2018, Prince Harry and Meghan Markle married at St George’s Chapel, Windsor; itwas estimated that 1.9 billion were watching the ceremony worldwide

The brain and ANN history

Human brain as a computer

• The human brain is a pretty slow computer, but it can do somepretty amazing things.

• How does it do it?• Lots of neurons: 1011 neurons, about 7000 synapses each, so intotal about 1014–1015 connections (synapses).

1


• Each neuron:• gets a signal through the dendrites from other neurons;• sends signals from time to time through the axon;• synapses are connections between one neuron’s axon and otherneurons’ dendrites attached to it.

1


• A neuron produces spikes stochastically.• The firing rate is only 10 to 200 Hz.• But we can recognize a face in a couple hundred milliseconds.• This means that sequential computations are very short.• But the brain is heavily parallelized.

2


• Here is how we process visual input:

3

Feature learning

• Another part of the story: feature learning.• The brain can train very well on a very small data sample.• And it can adapt to new data sources.• How does it do it? Can we do the same?

4

Feature learning

• Systems for processing unstructured data look like

input→ features→ classifier

• For a long time good features had to be handmade:• MFCC for speech recognition;• SIFT for image processing;• ...

• Feature engineering: can we learn good features automatically?

4

Plasticity

• Third point: neuroplasticity.

• It appears that there is a single learning algorithm (“The MasterAlgorithm”) that can be applied to very different situations.

5

The mathematical essence

• Generally speaking, all we do in machine learning isapproximation and optimization of various functions.

• For example:• a function from a picture to what it shows;• even simpler: binary function from an image’s pixels to “is it a cator not”;

• it’s a rather complicated function, and it’s defined as a dataset(values at a few points);

• how do we approximate it?

6

The mathematical essence

• Common approach in machine learning:• construct a class of models, usually parametric;• tune the parameters so that the model fits the data well;• i.e., we optimize a certain error function or quality metric.

• Neural networks are basically universal approximators, a verywide and flexible parametric class of functions.

• The problem is how to train them well.

6

ANN history

• Let’s try to mimic nature: get a large highly connected thing ofsmall simple things.

• The idea of modeling neurons with simple functions is very-veryold (McCulloch, Pitts, 1943).

• Artificial neural networks (even with several layers) wereproposed, among others, by Turing (1948).

• Rosenblatt, 1958: perceptron (one artificial neuron, will see in aminute): a linear separating surface.

• Training by gradient descent also appeared at the same time.

7

ANN history

• One neuron is modeled as follows:

• Linear combination of inputs follows by a nonlinear activationfunction:

y = h(w⊤x) = h(∑

i

wixi

).

• How can you train a perceptron and what will be the result?

7

ANN history

• Training by gradient descent.• Original Rosenblatt’s perceptron (h = id):

EP(w) = −∑x∈M

y(x)(w⊤x

),

w(τ+1) = w(τ) − η∇wEP(w) = w(τ) + ηtnxn.

• Or, e.g., we can do binary classification by plugging in logisticsigmoid as the activation function h(x) = σ(x) = 1

1+e−x :

E(w) = − 1N

N∑i=1

(yi log σ(w⊤xi) + (1− yi) log

(1− σ(w⊤xi)

)).

• Basically logistic regression.

7

ANN history

• The first perceptron (Rosenblatt, 1958) learned to recognizeletters.

7

ANN history

• Hailed as a huge success:

7

ANN history

• Minsky’s critique: even with a nonlinearity, one perceptron canonly have a linear separating surface.

7

ANN history

• Well, sure! You need to make a network of perceptrons:

• More about that later (actually, the whole course is about that).• But first let’s see what people do with a single perceptron.

7

ANN history

• 1960s: studies of the perceptron.• (Minsky, Papert, 1969): XOR cannot be modeled by a perceptron.• For some reason, this was thought to be a big problem.• (Brison, Hoback, 1969): backpropagation.• (Hinton, 1974): rediscovered backpropagation and popularized it.• Second half of the 1970s – multilayer ANN, basically in themodern form.

• Deep models appeared in early 1980s! It’s not much of an idea.

7

ANN history

• But by the early 1990s neural networks did not live up to thehype (in part for technical reasons).

• John Denker, 1994: «neural networks are the second best way ofdoing just about anything».

• So in the 1990s, neural networks were all but forgotten again.• Except for image processing (Yann LeCun) and a few moregroups (Geoffrey Hinton, Yoshua Bengio).

7

ANN history

• Deep learning revolution begins in 2006: Geoffrey Hinton’s grouplearned how to train Deep Belief Networks (DBN).

• Main idea: unsupervised pretraining.• Next component: more powerful computers, computations onGPUs.

• And much larger datasets.• Then we got new regularization and optimization techniques(we’ll talk about those), and pretraining by now is not reallynecessary.

• These, together with the architectures, are the main componentsof the deep learning revolution.

7

Activation functions

• There are many different nonlinearities; let’s do a brief survey.• Logistic sigmoid:

σ(x) = 11+ e−x .

8


• Hyperbolic tangent:

tanh(x) = ex − e−xex + e−x

• Very similar, but zero is not a saturation point.

8


• Heaviside step function:

step(x) ={0, if x < 0,1, if x > 0.

8


• Rectified linear unit (ReLU):

ReLU(x) ={0, if x < 0,x, if x ≥ 0.

8


• Mathematical motivation for ReLU:

σ

(x+ 1

2

)+ σ

(x− 1

2

)+ σ

(x− 3

2

)+ σ

(x− 5

2

)+ . . . .

8


• Biological motivation for ReLU:

f(I) =

(τ log E+RI−Vreset

E+RI−Vth

)−1, if E+ RI > Vth,

0, if E+ RI ≤ Vth,

8


• There are several ReLU variations. Leaky ReLU and ParametricReLU:

LReLU(x) ={ax, if x < 0,x, if x > 0.

.

8


• Exponential linear unit:

ELU(x) ={α (ex − 1) , x < 0,x, x ≥ 0.

• The jury is still out, there are probably more to come, but usuallyReLU is just fine.

8

Real neurons

• Do actual neurons do gradient descent?• Probably not (we’ll see that backprop would be hard to do).• Although Geoffrey Hinton had an interesting talk about it. Butstill, no.

• Hebbian learning – increase the weight between neurons thatfire together:

∆wi = ηxixj.

• Led to Hopfield networks that kinda do associative memory.• Current research: spike-timing-dependent plasticity; i.e., theincrease depends on the actual temporal properties of signals.

9

Combining perceptrons into a network

• A network of perceptrons; outputs of one are inputs of another.• Hornik, 1990: a two-layer ANN can approximate any function.• This is just theory, but in practice, deep networks are indeedmore expressive – distributed representations:

10


• Note also how they NNs are usually organized in layers.• This is natural and suggests easy parallelization.

10


• So we approximate a very complicated function with a largecomposition of simple functions.

• How do we train it?• Simple answer: gradient descent. But there are complicationshere.

10

Gradient descentand computational graphs

Gradient descent

• Gradient descent: take the gradient w.r.t. weights, move in thatdirection.

• Formally: for an error function E, targets y, and model f withparameters θ,

E(θ) =∑

(x,y)∈DE(f(x,θ), y),

θt = θt−1 − η∇E(θt−1) = θt−1 − η∑

(x,y)∈D∇E(f(x,θt−1), y).

• So we need to sum over the entire dataset for every step?!..

11

Gradient descent

• Hence, stochastic gradient descent: a ter every training sampleupdate

θt = θt−1 − η∇E(f(xt,θt−1), yt),

• In practice people usually use mini-batches, it’s easy toparallelize and smoothes out excessive “stochasticity”.

• So far the only parameter is the learning rate η.

11

Gradient descent

• Lots of problems with η:

• We will get to them later, for now let’s concentrate on thecertainly required step: the derivatives.

11

Computational graph, frop and bprop

• Let us represent a function as a composition of simple functions(“simple” means that we can take derivatives).

• Example – f(x, y) = x2 + xy+ (x+ y)2:

12


• This way we can take the gradient with the chain rule:

(f ◦ g)′(x) = (f(g(x)))′ = f′(g(x))g′(x).

• This simply means that an increment δx results in

δf = f′(g(x))δg = f′(g(x))g′(x)δx.

• We only need to be able to take gradients, i.e., derivatives w.r.t.vectors:

∇xf =

∂f∂x1...∂f∂xn

.

∇x(f ◦ g) =

∂f◦g∂x1...

∂f◦g∂xn

=

∂f∂g

∂g∂x1...

∂f∂g

∂g∂xn

=∂f∂g∇xg.

12


• Or, if f depends on x in several different ways,f = f(g1(x),g2(x), . . . ,gk(x)), the increment δx now comes intoplay several times:

∂f∂x =

∂f∂g1

∂g1∂x + . . .+

∂f∂gk

∂gk∂x =

k∑i=1

∂f∂gi

∂gi∂x .

∇xf =∂f∂g1

∇xg1 + . . .+∂f∂gk

∇xgk =k∑i=1

∂f∂gi

∇xgi.

• Note that we got matrix multiplication for the Jacobi matrix:

∇xf = ∇xg∇gf, where ∇xg =

∂g1∂x1 . . . ∂gk

∂x1......

∂g1∂xn . . . ∂gk

∂xn

.

12


• Let’s now go back to the example:

12


• Forward propagation: we compute ∂f∂x by the chain rule.

12


• Backpropagation: starting from the end node, go back as∂f∂g =

∑g′∈Children(g)

∂f∂g′

∂g′∂g .

12


• Backprop is much better: we get all derivatives in a single passthrough the graph.

• Aaaand... that’s it! We can now take the gradients of anycomplicated composition of simple functions.

• Which is all we need to apply gradient descent!• The libraries – PyTorch, TensorFlow, theano – are actuallyautomatic differentiation libraries. This is their main function.

• So you can implement lots of “classical” models in TensorFlowand train them by gradient descent.

• And live neurons can’t do that because you need two differentoutputs to compute the value and the derivative.

12

Regularizationin neural networks

Regularization in neural networks

• NNs have lots of parameters.• Regularization is necessary.• L2 or L1 regularization (λ

∑w w2 or λ

∑w |w|) is called weight

decay.• Very easy to add, just another term in the objective function.• Sometimes still useful.

13


• Regularization is very easy to add in all libraries. E.g., in Keras:• W_regularizer adds a regularizer on the weight matrix of alayer;

• b_regularizer — on the bias vector;• activity_regularizer — on the output vector.

13


• Second way: early stopping.• Let’s just stop when the error on the validation set begins toincrease!

• Also implemented out of the box in Keras, via callbacks.

13


• Third way – max-norm constraint.• Let’s bound the norm of the vector of weights for every neuronartificially:

∥w∥2 ≤ c.

• We can do it during optimization as well: when w leaves the ballof radious c, project it back.

13

Dropout

• But there are NN-specific ways too.• Dropout: let’s just remove some units at random with probabilityp!(Srivastava et al., 2014)

14

Dropout

• Turns out we sample a huge number of networks, and theneuron get an “average” activation from a lot of differentarchitectures.

• How do we use a network trained like that? Should we sample alot of networks and average them out?

14

Dropout

• To apply, simply multiply the result by 1/p (preserving averageoutput):

• And you can usually take p = 12 , there’s seldom any big

difference.

14

Dropout

• Dropout improves everything (large NNs, convolutional NNs) alot... but why? WTF?

14

Dropout

• Idea 1: we are making the units learn features by themselves,without relying on the others.

14

Dropout

• Similarly, this leads to sparsity:

14

Dropout

• Idea 2: we are kind of averaging a huge number of networkswith shared weights, training each for one step.

• It’s like bootstrapping or boosting taken to the extreme, and weknow xgboost rules Kaggle.

14

Dropout

• Idea 3: this is just like sex!• How does sexual reproduction work?• It’s important to get not just a good combination of genes but astable good combination.

14

Dropout

• Idea 4: a neuron sends out activation a with probability 0.5.• Let’s do it the other way around: send out 0.5 (or 1) withprobability a (a/2).

• Same expectation, larger variance for small p (which is not sobad).

• And thus we get stochastic neurons that send out signals atrandom intervals – just like the brain!

• Also brings improvements but needs less communicationbetween the neurons (one bit instead of a float).

• Maybe stochastic neurons work as a dropout regularizer! Andmaybe that’s why we are able to actually think of something.

14

Dropout

• Idea 5: dropout is a special form of prior.• The most rigorous and useful idea, it helped fix dropout inrecurrent networks.

• But this requires a Bayesian view of neural networks, soprobably not now.

14

Conclusion

• Thus, dropout is a very cool method.

• But it is now losing popularity (almost lost it altogether) due tobatch normalization and new versions of gradient descent.

• We’ll talk about them later, but first let’s tackle weightinitialization.

15

Thank you!

Thank you for your attention!

16

deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · deep neural networks:...

Documents