deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · deep neural networks:...

61
Deep neural networks: the basics Master’s Computer Vision Sergey Nikolenko, Alex Davydow Harbour Space University, Barcelona May 19, 2020 Random facts: On May 19, 1655, the Anglo-Spanish War began with an invasion of Jamaica; soon aǒter taking the island the English were starving and dying of disease; still, they persevered, Jamaica was ceded at the Treaty of Madrid and soon became a hugely profitable sugar colony On May 19, 1910, Earth passed through the tail of Halley’s Comet; anti-comet pills and anti-comet umbrellas were widely sold to the public; Mark Twain was born two weeks aǒter the previous perihelion and died the day aǒter the perihelion in 1910 On May 19, 1980, Apple announced Apple III, the first failure of Apple; the hardware oǒten failed, there were problems with cooling and board design; Steve Wozniak estimated that Apple “put $100 million in advertising, promotion, and research and development into a product that was 3 percent of our revenues” On May 19, 2018, Prince Harry and Meghan Markle married at St George’s Chapel, Windsor; it was estimated that 1.9 billion were watching the ceremony worldwide

Upload: others

Post on 14-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Deep neural networks: the basicsMaster’s Computer Vision

Sergey Nikolenko, Alex Davydow

Harbour Space University, BarcelonaMay 19, 2020

Random facts:

• On May 19, 1655, the Anglo-Spanish War began with an invasion of Jamaica; soon a ter takingthe island the English were starving and dying of disease; still, they persevered, Jamaica wasceded at the Treaty of Madrid and soon became a hugely profitable sugar colony

• On May 19, 1910, Earth passed through the tail of Halley’s Comet; anti-comet pills andanti-comet umbrellas were widely sold to the public; Mark Twain was born two weeks a terthe previous perihelion and died the day a ter the perihelion in 1910

• On May 19, 1980, Apple announced Apple III, the first failure of Apple; the hardware o tenfailed, there were problems with cooling and board design; Steve Wozniak estimated thatApple “put $100 million in advertising, promotion, and research and development into aproduct that was 3 percent of our revenues”

• On May 19, 2018, Prince Harry and Meghan Markle married at St George’s Chapel, Windsor; itwas estimated that 1.9 billion were watching the ceremony worldwide

Page 2: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

The brain and ANN history

Page 3: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Human brain as a computer

• The human brain is a pretty slow computer, but it can do somepretty amazing things.

• How does it do it?• Lots of neurons: 1011 neurons, about 7000 synapses each, so intotal about 1014–1015 connections (synapses).

1

Page 4: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Human brain as a computer

• Each neuron:• gets a signal through the dendrites from other neurons;• sends signals from time to time through the axon;• synapses are connections between one neuron’s axon and otherneurons’ dendrites attached to it.

1

Page 5: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Human brain as a computer

• A neuron produces spikes stochastically.• The firing rate is only 10 to 200 Hz.• But we can recognize a face in a couple hundred milliseconds.• This means that sequential computations are very short.• But the brain is heavily parallelized.

2

Page 6: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Human brain as a computer

• Here is how we process visual input:

3

Page 7: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Feature learning

• Another part of the story: feature learning.• The brain can train very well on a very small data sample.• And it can adapt to new data sources.• How does it do it? Can we do the same?

4

Page 8: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Feature learning

• Systems for processing unstructured data look like

input→ features→ classifier

• For a long time good features had to be handmade:• MFCC for speech recognition;• SIFT for image processing;• ...

• Feature engineering: can we learn good features automatically?

4

Page 9: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Plasticity

• Third point: neuroplasticity.

• It appears that there is a single learning algorithm (“The MasterAlgorithm”) that can be applied to very different situations.

5

Page 10: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

The mathematical essence

• Generally speaking, all we do in machine learning isapproximation and optimization of various functions.

• For example:• a function from a picture to what it shows;• even simpler: binary function from an image’s pixels to “is it a cator not”;

• it’s a rather complicated function, and it’s defined as a dataset(values at a few points);

• how do we approximate it?

6

Page 11: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

The mathematical essence

• Common approach in machine learning:• construct a class of models, usually parametric;• tune the parameters so that the model fits the data well;• i.e., we optimize a certain error function or quality metric.

• Neural networks are basically universal approximators, a verywide and flexible parametric class of functions.

• The problem is how to train them well.

6

Page 12: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

ANN history

• Let’s try to mimic nature: get a large highly connected thing ofsmall simple things.

• The idea of modeling neurons with simple functions is very-veryold (McCulloch, Pitts, 1943).

• Artificial neural networks (even with several layers) wereproposed, among others, by Turing (1948).

• Rosenblatt, 1958: perceptron (one artificial neuron, will see in aminute): a linear separating surface.

• Training by gradient descent also appeared at the same time.

7

Page 13: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

ANN history

• One neuron is modeled as follows:

• Linear combination of inputs follows by a nonlinear activationfunction:

y = h(w⊤x) = h(∑

i

wixi

).

• How can you train a perceptron and what will be the result?

7

Page 14: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

ANN history

• Training by gradient descent.• Original Rosenblatt’s perceptron (h = id):

EP(w) = −∑x∈M

y(x)(w⊤x

),

w(τ+1) = w(τ) − η∇wEP(w) = w(τ) + ηtnxn.

• Or, e.g., we can do binary classification by plugging in logisticsigmoid as the activation function h(x) = σ(x) = 1

1+e−x :

E(w) = − 1N

N∑i=1

(yi log σ(w⊤xi) + (1− yi) log

(1− σ(w⊤xi)

)).

• Basically logistic regression.

7

Page 15: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

ANN history

• The first perceptron (Rosenblatt, 1958) learned to recognizeletters.

7

Page 16: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

ANN history

• Hailed as a huge success:

7

Page 17: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

ANN history

• Minsky’s critique: even with a nonlinearity, one perceptron canonly have a linear separating surface.

7

Page 18: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

ANN history

• Well, sure! You need to make a network of perceptrons:

• More about that later (actually, the whole course is about that).• But first let’s see what people do with a single perceptron.

7

Page 19: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

ANN history

• 1960s: studies of the perceptron.• (Minsky, Papert, 1969): XOR cannot be modeled by a perceptron.• For some reason, this was thought to be a big problem.• (Brison, Hoback, 1969): backpropagation.• (Hinton, 1974): rediscovered backpropagation and popularized it.• Second half of the 1970s – multilayer ANN, basically in themodern form.

• Deep models appeared in early 1980s! It’s not much of an idea.

7

Page 20: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

ANN history

• But by the early 1990s neural networks did not live up to thehype (in part for technical reasons).

• John Denker, 1994: «neural networks are the second best way ofdoing just about anything».

• So in the 1990s, neural networks were all but forgotten again.• Except for image processing (Yann LeCun) and a few moregroups (Geoffrey Hinton, Yoshua Bengio).

7

Page 21: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

ANN history

• Deep learning revolution begins in 2006: Geoffrey Hinton’s grouplearned how to train Deep Belief Networks (DBN).

• Main idea: unsupervised pretraining.• Next component: more powerful computers, computations onGPUs.

• And much larger datasets.• Then we got new regularization and optimization techniques(we’ll talk about those), and pretraining by now is not reallynecessary.

• These, together with the architectures, are the main componentsof the deep learning revolution.

7

Page 22: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Activation functions

• There are many different nonlinearities; let’s do a brief survey.• Logistic sigmoid:

σ(x) = 11+ e−x .

8

Page 23: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Activation functions

• Hyperbolic tangent:

tanh(x) = ex − e−xex + e−x

• Very similar, but zero is not a saturation point.

8

Page 24: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Activation functions

• Heaviside step function:

step(x) ={0, if x < 0,1, if x > 0.

8

Page 25: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Activation functions

• Rectified linear unit (ReLU):

ReLU(x) ={0, if x < 0,x, if x ≥ 0.

8

Page 26: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Activation functions

• Mathematical motivation for ReLU:

σ

(x+ 1

2

)+ σ

(x− 1

2

)+ σ

(x− 3

2

)+ σ

(x− 5

2

)+ . . . .

8

Page 27: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Activation functions

• Biological motivation for ReLU:

f(I) =

(τ log E+RI−Vreset

E+RI−Vth

)−1, if E+ RI > Vth,

0, if E+ RI ≤ Vth,

8

Page 28: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Activation functions

• There are several ReLU variations. Leaky ReLU and ParametricReLU:

LReLU(x) ={ax, if x < 0,x, if x > 0.

.

8

Page 29: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Activation functions

• Exponential linear unit:

ELU(x) ={α (ex − 1) , x < 0,x, x ≥ 0.

• The jury is still out, there are probably more to come, but usuallyReLU is just fine.

8

Page 30: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Real neurons

• Do actual neurons do gradient descent?• Probably not (we’ll see that backprop would be hard to do).• Although Geoffrey Hinton had an interesting talk about it. Butstill, no.

• Hebbian learning – increase the weight between neurons thatfire together:

∆wi = ηxixj.

• Led to Hopfield networks that kinda do associative memory.• Current research: spike-timing-dependent plasticity; i.e., theincrease depends on the actual temporal properties of signals.

9

Page 31: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Combining perceptrons into a network

• A network of perceptrons; outputs of one are inputs of another.• Hornik, 1990: a two-layer ANN can approximate any function.• This is just theory, but in practice, deep networks are indeedmore expressive – distributed representations:

10

Page 32: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Combining perceptrons into a network

• Note also how they NNs are usually organized in layers.• This is natural and suggests easy parallelization.

10

Page 33: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Combining perceptrons into a network

• So we approximate a very complicated function with a largecomposition of simple functions.

• How do we train it?• Simple answer: gradient descent. But there are complicationshere.

10

Page 34: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Gradient descentand computational graphs

Page 35: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Gradient descent

• Gradient descent: take the gradient w.r.t. weights, move in thatdirection.

• Formally: for an error function E, targets y, and model f withparameters θ,

E(θ) =∑

(x,y)∈DE(f(x,θ), y),

θt = θt−1 − η∇E(θt−1) = θt−1 − η∑

(x,y)∈D∇E(f(x,θt−1), y).

• So we need to sum over the entire dataset for every step?!..

11

Page 36: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Gradient descent

• Hence, stochastic gradient descent: a ter every training sampleupdate

θt = θt−1 − η∇E(f(xt,θt−1), yt),

• In practice people usually use mini-batches, it’s easy toparallelize and smoothes out excessive “stochasticity”.

• So far the only parameter is the learning rate η.

11

Page 37: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Gradient descent

• Lots of problems with η:

• We will get to them later, for now let’s concentrate on thecertainly required step: the derivatives.

11

Page 38: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Computational graph, frop and bprop

• Let us represent a function as a composition of simple functions(“simple” means that we can take derivatives).

• Example – f(x, y) = x2 + xy+ (x+ y)2:

12

Page 39: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Computational graph, frop and bprop

• This way we can take the gradient with the chain rule:

(f ◦ g)′(x) = (f(g(x)))′ = f′(g(x))g′(x).

• This simply means that an increment δx results in

δf = f′(g(x))δg = f′(g(x))g′(x)δx.

• We only need to be able to take gradients, i.e., derivatives w.r.t.vectors:

∇xf =

∂f∂x1...∂f∂xn

.

∇x(f ◦ g) =

∂f◦g∂x1...

∂f◦g∂xn

=

∂f∂g

∂g∂x1...

∂f∂g

∂g∂xn

=∂f∂g∇xg.

12

Page 40: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Computational graph, frop and bprop

• Or, if f depends on x in several different ways,f = f(g1(x),g2(x), . . . ,gk(x)), the increment δx now comes intoplay several times:

∂f∂x =

∂f∂g1

∂g1∂x + . . .+

∂f∂gk

∂gk∂x =

k∑i=1

∂f∂gi

∂gi∂x .

∇xf =∂f∂g1

∇xg1 + . . .+∂f∂gk

∇xgk =k∑i=1

∂f∂gi

∇xgi.

• Note that we got matrix multiplication for the Jacobi matrix:

∇xf = ∇xg∇gf, where ∇xg =

∂g1∂x1 . . . ∂gk

∂x1......

∂g1∂xn . . . ∂gk

∂xn

.

12

Page 41: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Computational graph, frop and bprop

• Let’s now go back to the example:

12

Page 42: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Computational graph, frop and bprop

• Forward propagation: we compute ∂f∂x by the chain rule.

12

Page 43: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Computational graph, frop and bprop

• Backpropagation: starting from the end node, go back as∂f∂g =

∑g′∈Children(g)

∂f∂g′

∂g′∂g .

12

Page 44: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Computational graph, frop and bprop

• Backprop is much better: we get all derivatives in a single passthrough the graph.

• Aaaand... that’s it! We can now take the gradients of anycomplicated composition of simple functions.

• Which is all we need to apply gradient descent!• The libraries – PyTorch, TensorFlow, theano – are actuallyautomatic differentiation libraries. This is their main function.

• So you can implement lots of “classical” models in TensorFlowand train them by gradient descent.

• And live neurons can’t do that because you need two differentoutputs to compute the value and the derivative.

12

Page 45: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Regularizationin neural networks

Page 46: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Regularization in neural networks

• NNs have lots of parameters.• Regularization is necessary.• L2 or L1 regularization (λ

∑w w2 or λ

∑w |w|) is called weight

decay.• Very easy to add, just another term in the objective function.• Sometimes still useful.

13

Page 47: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Regularization in neural networks

• Regularization is very easy to add in all libraries. E.g., in Keras:• W_regularizer adds a regularizer on the weight matrix of alayer;

• b_regularizer — on the bias vector;• activity_regularizer — on the output vector.

13

Page 48: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Regularization in neural networks

• Second way: early stopping.• Let’s just stop when the error on the validation set begins toincrease!

• Also implemented out of the box in Keras, via callbacks.

13

Page 49: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Regularization in neural networks

• Third way – max-norm constraint.• Let’s bound the norm of the vector of weights for every neuronartificially:

∥w∥2 ≤ c.

• We can do it during optimization as well: when w leaves the ballof radious c, project it back.

13

Page 50: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Dropout

• But there are NN-specific ways too.• Dropout: let’s just remove some units at random with probabilityp!(Srivastava et al., 2014)

14

Page 51: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Dropout

• Turns out we sample a huge number of networks, and theneuron get an “average” activation from a lot of differentarchitectures.

• How do we use a network trained like that? Should we sample alot of networks and average them out?

14

Page 52: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Dropout

• To apply, simply multiply the result by 1/p (preserving averageoutput):

• And you can usually take p = 12 , there’s seldom any big

difference.

14

Page 53: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Dropout

• Dropout improves everything (large NNs, convolutional NNs) alot... but why? WTF?

14

Page 54: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Dropout

• Idea 1: we are making the units learn features by themselves,without relying on the others.

14

Page 55: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Dropout

• Similarly, this leads to sparsity:

14

Page 56: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Dropout

• Idea 2: we are kind of averaging a huge number of networkswith shared weights, training each for one step.

• It’s like bootstrapping or boosting taken to the extreme, and weknow xgboost rules Kaggle.

14

Page 57: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Dropout

• Idea 3: this is just like sex!• How does sexual reproduction work?• It’s important to get not just a good combination of genes but astable good combination.

14

Page 58: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Dropout

• Idea 4: a neuron sends out activation a with probability 0.5.• Let’s do it the other way around: send out 0.5 (or 1) withprobability a (a/2).

• Same expectation, larger variance for small p (which is not sobad).

• And thus we get stochastic neurons that send out signals atrandom intervals – just like the brain!

• Also brings improvements but needs less communicationbetween the neurons (one bit instead of a float).

• Maybe stochastic neurons work as a dropout regularizer! Andmaybe that’s why we are able to actually think of something.

14

Page 59: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Dropout

• Idea 5: dropout is a special form of prior.• The most rigorous and useful idea, it helped fix dropout inrecurrent networks.

• But this requires a Bayesian view of neural networks, soprobably not now.

14

Page 60: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Conclusion

• Thus, dropout is a very cool method.

• But it is now losing popularity (almost lost it altogether) due tobatch normalization and new versions of gradient descent.

• We’ll talk about them later, but first let’s tackle weightinitialization.

15

Page 61: Deep neural networks: the basicssergey/teaching/harbourcv20/02-dlintro.pdf · Deep neural networks: the basics Master’sComputerVision SergeyNikolenko,AlexDavydow HarbourSpaceUniversity,Barcelona

Thank you!

Thank you for your attention!

16