lecture 5 recap · •approximate our function by a second-order taylor series expansion ......

52
Lecture 5 recap 1 Prof. Leal-Taixé and Prof. Niessner

Upload: others

Post on 06-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Lecture 5 recap

1Prof. Leal-Taixé and Prof. Niessner

Page 2: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Neural Network

Depth

Wid

th

2Prof. Leal-Taixé and Prof. Niessner

Page 3: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Gradient Descent for Neural Networks

Prof. Leal-Taixé and Prof. Niessner 3

𝑥0

𝑥1

𝑥2

ℎ0

ℎ1

ℎ2

ℎ3

𝑦0

𝑦1

𝑡0

𝑡1

𝑦𝑖 = 𝐴(𝑏1,𝑖 +

𝑗

ℎ𝑗𝑤1,𝑖,𝑗)ℎ𝑗 = 𝐴(𝑏0,𝑗 +

𝑘

𝑥𝑘𝑤0,𝑗,𝑘)

𝐿𝑖 = 𝑦𝑖 − 𝑡𝑖2

𝛻𝑤,𝑏𝑓𝑥,𝑡 (𝑤) =

𝜕𝑓

𝜕𝑤0,0,0……𝜕𝑓

𝜕𝑤𝑙,𝑚,𝑛……𝜕𝑓

𝜕𝑏𝑙,𝑚

Just simple: 𝐴 𝑥 = max(0, 𝑥)

Page 4: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Stochastic Gradient Descent (SGD)𝜃𝑘+1 = 𝜃𝑘 − 𝛼𝛻𝜃𝐿(𝜃

𝑘 , 𝑥{1..𝑚}, 𝑦{1..𝑚})

𝛻𝜃𝐿 =1

𝑚σ𝑖=1𝑚 𝛻𝜃𝐿𝑖

Note the terminology: iteration vs epoch

Prof. Leal-Taixé and Prof. Niessner 4

𝑘 now refers to 𝑘-th iteration

𝑚 training samples in the current batch

Gradient for the 𝑘-th batch

Page 5: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Gradient Descent with Momentum𝑣𝑘+1 = 𝛽 ⋅ 𝑣𝑘 + 𝛻𝜃𝐿(𝜃

𝑘)

𝜃𝑘+1 = 𝜃𝑘 − 𝛼 ⋅ 𝑣𝑘+1

Exponentially-weighted average of gradient

Important: velocity 𝑣𝑘 is vector-valued!

Prof. Leal-Taixé and Prof. Niessner 5

Gradient of current minibatchvelocity

accumulation rate(‘friction’, momentum)

learning ratevelocitymodel

Page 6: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Gradient Descent with Momentum

𝜃𝑘+1 = 𝜃𝑘 − 𝛼 ⋅ 𝑣𝑘+1

Prof. Leal-Taixé and Prof. Niessner 6

Step will be largest when a sequence of gradients all point to the same direction

Fig. credit: I. Goodfellow

Hyperparameters are 𝛼, 𝛽𝛽 is often set to 0.9

Page 7: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

RMSProp

𝑠𝑘+1 = 𝛽 ⋅ 𝑠𝑘 + (1 − 𝛽)[𝛻𝜃𝐿 ∘ 𝛻𝜃𝐿]

𝜃𝑘+1 = 𝜃𝑘 − 𝛼 ⋅𝛻𝜃𝐿

𝑠𝑘+1 + 𝜖

Hyperparameters: 𝛼, 𝛽, 𝜖

Prof. Leal-Taixé and Prof. Niessner 7

Typically 10−8Often 0.9

Element-wise multiplication

Needs tuning!

Page 8: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

RMSProp

Prof. Leal-Taixé and Prof. Niessner 8

X-directionSmall gradients

Y-D

irec

tio

nLa

rge

grad

ien

ts

Fig. credit: A. Ng

𝑠𝑘+1 = 𝛽 ⋅ 𝑠𝑘 + (1 − 𝛽)[𝛻𝜃𝐿 ∘ 𝛻𝜃𝐿]

𝜃𝑘+1 = 𝜃𝑘 − 𝛼 ⋅𝛻𝜃𝐿

𝑠𝑘+1 + 𝜖We’re dividing by square gradients:- Division in Y-Direction will be large- Division in X-Direction will be small

(uncentered) variance of gradients -> second momentum

Can increase learning rate!

Page 9: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Adaptive Moment Estimation (Adam)Combines Momentum and RMSProp

𝑚𝑘+1 = 𝛽1 ⋅ 𝑚𝑘 + 1 − 𝛽1 𝛻𝜃𝐿 𝜃𝑘

𝑣𝑘+1 = 𝛽2 ⋅ 𝑣𝑘 + (1 − 𝛽2)[𝛻𝜃𝐿 𝜃𝑘 ∘ 𝛻𝜃𝐿 𝜃𝑘 ]

𝜃𝑘+1 = 𝜃𝑘 − 𝛼 ⋅𝑚𝑘+1

𝑣𝑘+1+𝜖

Prof. Leal-Taixé and Prof. Niessner 9

First momentum: mean of gradients

Second momentum: variance of gradients

Page 10: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

AdamCombines Momentum and RMSProp

𝑚𝑘+1 = 𝛽1 ⋅ 𝑚𝑘 + 1 − 𝛽1 𝛻𝜃𝐿 𝜃𝑘

𝑣𝑘+1 = 𝛽2 ⋅ 𝑣𝑘 + (1 − 𝛽2)[𝛻𝜃𝐿 𝜃𝑘 ∘ 𝛻𝜃𝐿 𝜃𝑘 ]

𝜃𝑘+1 = 𝜃𝑘 − 𝛼 ⋅ෝ𝑚𝑘+1

ො𝑣𝑘+1+𝜖

Prof. Leal-Taixé and Prof. Niessner 10

𝑚𝑘+1 and 𝑣𝑘+1 are initialized with zero-> bias towards zero

Typically, bias-corrected moment updates

ෝ𝑚𝑘+1 =𝑚𝑘

1 − 𝛽1

ො𝑣𝑘+1 =𝑣𝑘

1 − 𝛽2

Page 11: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Convergence

11Prof. Leal-Taixé and Prof. Niessner

Page 12: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Convergence

12Prof. Leal-Taixé and Prof. Niessner

Page 13: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Importance of Learning Rate

13Prof. Leal-Taixé and Prof. Niessner

Page 14: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Jacobian and Hessian

• Derivative

• Gradient

• Jacobian

• HessianSECOND

DERIVATIVE

14Prof. Leal-Taixé and Prof. Niessner

Page 15: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Newton’s method• Approximate our function by a second-order Taylor

series expansion

https://en.wikipedia.org/wiki/Taylor_series

First derivative Second derivative (curvature)

15Prof. Leal-Taixé and Prof. Niessner

Page 16: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Newton’s method• Differentiate and equate to zero

Update step

SGD

We got rid of the learning rate!

16Prof. Leal-Taixé and Prof. Niessner

Page 17: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Newton’s method• Differentiate and equate to zero

Update step

Parameters of a network

(millions)

Number of elements in the Hessian

Computational complexity of

‘inversion’ per iteration

17Prof. Leal-Taixé and Prof. Niessner

Page 18: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Newton’s method• SGD (green)

• Newton’s method exploits the curvature to take a more direct route

Image from Wikipedia18

Prof. Leal-Taixé and Prof. Niessner

Page 19: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Newton’s method

Can you apply Newton’s method for linear

regression? What do you get as a result?

19Prof. Leal-Taixé and Prof. Niessner

Page 20: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

BFGS and L-BFGS• Broyden-Fletcher-Goldfarb-Shanno algorithm• Belongs to the family of quasi-Newton methods• Have an approximation of the inverse of the Hessian

• BFGS• Limited memory: L-BFGS

20Prof. Leal-Taixé and Prof. Niessner

Page 21: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Gauss-Newton• 𝑥𝑘+1 = 𝑥𝑘 − 𝐻𝑓 𝑥𝑘

−1𝛻𝑓(𝑥𝑘)

– ’true’ 2nd derivatives are often hard to obtain (e.g., numerics)

– 𝐻𝑓 ≈ 2𝐽𝐹𝑇𝐽𝐹

• Gauss-Newton (GN): 𝑥𝑘+1 = 𝑥𝑘 − [2𝐽𝐹 𝑥𝑘

𝑇𝐽𝐹 𝑥𝑘 ]−1𝛻𝑓(𝑥𝑘)

• Solve linear system (again, inverting a matrix is unstable):2 𝐽𝐹 𝑥𝑘

𝑇𝐽𝐹 𝑥𝑘 𝑥𝑘 − 𝑥𝑘+1 = 𝛻𝑓(𝑥𝑘)

Solve for delta vector

Prof. Leal-Taixé and Prof. Niessner 21

Page 22: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Levenberg• Levenberg

– “damped” version of Gauss-Newton:– 𝐽𝐹 𝑥𝑘

𝑇𝐽𝐹 𝑥𝑘 + 𝜆 ⋅ 𝐼 ⋅ 𝑥𝑘 − 𝑥𝑘+1 = 𝛻𝑓(𝑥𝑘)

– The damping factor 𝜆 is adjusted in each iteration ensuring:– 𝑓 𝑥𝑘 > 𝑓(𝑥𝑘+1)

• if inequation is not fulfilled increase 𝜆• Trust region

• “Interpolation” between Gauss-Newton (small 𝜆) and Gradient Descent (large 𝜆)

Tikhonovregularization

Prof. Leal-Taixé and Prof. Niessner 22

Page 23: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Levenberg-Marquardt

• Levenberg-Marquardt (LM)

𝐽𝐹 𝑥𝑘𝑇𝐽𝐹 𝑥𝑘 + 𝜆 ⋅ 𝑑𝑖𝑎𝑔(𝐽𝐹 𝑥𝑘

𝑇𝐽𝐹 𝑥𝑘 ) ⋅ 𝑥𝑘 − 𝑥𝑘+1= 𝛻𝑓(𝑥𝑘)

– Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient according to the curvature.• Avoids slow convergence in components with a small

gradient

Prof. Leal-Taixé and Prof. Niessner 23

Page 24: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Which, what and when?• Standard: Adam

• Fallback option: SGD with momentum

• Newton, L-BFGS, GN, LM only if you can do full batch updates (doesn’t work well for minibatches!!)

24Prof. Leal-Taixé and Prof. Niessner

This practically never happens for DLTheoretically, it would be nice though due to fast convergence

Page 25: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

General Optimization• Linear Systems (Ax = b)

– LU, QR, Cholesky, Jacobi, Gauss-Seidel, CG, PCG, etc.

• Non-linear (gradient-based)– Newton, Gauss-Newton, LM, (L)BFGS <- second order– Gradient Descent, SGD <- first order

• Others:– Genetic algorithms, MCMC, Metropolis-Hastings, etc.– Constrained and convex solvers (Langrage, ADMM,

Primal-Dual, etc.)Prof. Leal-Taixé and Prof. Niessner 25

Page 26: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Please Remember!• Think about your problem and optimization at hand

• SGD is specifically designed for minibatch

• When you can, use 2nd order method -> it’s just faster

• GD or SGD is not a way to solve a linear system!

Prof. Leal-Taixé and Prof. Niessner 26

Page 27: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Importance of Learning Rate

27Prof. Leal-Taixé and Prof. Niessner

Page 28: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Learning Rate

Prof. Leal-Taixé and Prof. Niessner 28

Need high learning rate when far away Need low learning rate when close

Page 29: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Learning Rate Decay

• 𝛼 =1

1+𝑑𝑒𝑐𝑎𝑦𝑟𝑎𝑡𝑒⋅𝑒𝑝𝑜𝑐ℎ⋅ 𝛼0

– E.g., 𝛼0 = 0.1, 𝑑𝑒𝑐𝑎𝑦𝑟𝑎𝑡𝑒 = 1.0

– > Epoch 0: 0.1

– > Epoch 1: 0.05

– > Epoch 2: 0.033

– > Epoch 3: 0.025

...Prof. Leal-Taixé and Prof. Niessner 29

0

0.02

0.04

0.06

0.08

0.1

0.12

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46

Learning Rate over Epochs

Page 30: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Learning Rate DecayMany options:

• Step decay 𝛼 = 𝛼 − 𝑡 ⋅ 𝛼 (only every n steps)– T is decay rate (often 0.5)

• Exponential decay 𝛼 = 𝑡𝑒𝑝𝑜𝑐ℎ ⋅ 𝛼0– t is decay rate (t < 1.0)

• 𝛼 =𝑡

𝑒𝑝𝑜𝑐ℎ⋅ 𝑎0

– t is decay rate

• Etc.

Prof. Leal-Taixé and Prof. Niessner 30

Page 31: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Training ScheduleManually specify learning rate for entire training process

• Manually set learning rate every n-epochs• How?

– Trial and error (the hard way)– Some experience (only generalizes to some degree)

Consider: #epochs, training set size, network size, etc.

Prof. Leal-Taixé and Prof. Niessner 31

Page 32: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Learning Rate: Implications

• What if too high?

• What if too low?

Prof. Leal-Taixé and Prof. Niessner 32

Page 33: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Training• Given ground dataset with ground lables

– {𝑥𝑖 , 𝑦𝑖}

• For instance 𝑥𝑖-th training image, with label 𝑦𝑖• Often dim 𝑥 ≫ dim(𝑦) (e.g., for classification)• 𝑖 is often in the 100-thousands or millions

– Take network 𝑓 and its parameters 𝑤, 𝑏

– Use SGD (or variation) to find optimal parameters 𝑤, 𝑏• Gradients from backprop

Prof. Leal-Taixé and Prof. Niessner 33

Page 34: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Learning• Learning means generalization to unknown dataset

– (so far no ‘real’ learning)– I.e., train on known dataset -> test with optimized

parameters on unknown dataset

• Basically, we hope that based on the train set, the optimized parameters will give similar results on different data (i.e., test data)

Prof. Leal-Taixé and Prof. Niessner 34

Page 35: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Learning• Training set (‘train’):

– Use for training your neural network

• Validation set (‘val’):– Hyperparameter optimization– Check generalization progress

• Test set (‘test’):– Only for the very end– NEVER TOUCH DURING DEVELOPMENT OR TRAINING

Prof. Leal-Taixé and Prof. Niessner 35

Page 36: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Learning• Typical splits

– Train (60%), Val (20%), Test (20%)– Train (80%), Val (10%), Test (10%)

• During training:– Train error comes from average mini-batch error– Typically take subset of validation every n iterations

Prof. Leal-Taixé and Prof. Niessner 36

Page 37: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Learning• Training graph

- Accuracy - Loss

Prof. Leal-Taixé and Prof. Niessner 37

(EMA smoothing)

Page 38: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Learning• Validation graph

Prof. Leal-Taixé and Prof. Niessner 38

Page 39: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Over- and Underfitting

Prof. Leal-Taixé and Prof. Niessner 39

Underfitted Appropriate Overfitted

Figure extracted from Deep Learning by Adam Gibson, Josh Patterson, O‘Reily Media Inc., 2017

Page 40: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Over- and Underfitting

Prof. Leal-Taixé and Prof. Niessner 40

Source: http://srdas.github.io/DLBook/ImprovingModelGeneralization.html

Page 41: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Hyperparameters• Network architecture (e.g., num layers, #weights)• Number of iterations• Learning rate(s) (i.e., solver parameters, decay, etc.)• Regularization (more later next lecture) • Batch size• …• Overall: learning setup + optimization = hyerparameter

Prof. Leal-Taixé and Prof. Niessner 41

Page 42: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Hyperparameter Tuning• Methods:

– Manual search: most common – Grid search (structured, for ‘real’ applications)

Define ranges for all parameters spaces and select points (usually pseudo-uniformly distributed). Iterate over all possible configurations

– Random search:Like grid search but one picks points at random in the predefined ranges

Prof. Leal-Taixé and Prof. Niessner 42

Page 43: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Simple Grid Search Examplelearning_rates = [1e-2, 1e-3, 1e-4, 1e-5]regularization_strengths = [1e2, 1e3, 1e4, 1e5]num_iters = [500, 1000, 1500]best_val = 0

for learning_rate in learning_rates:for reg in regularization_strengths:

for iterations in num_iters:model = train_model(learning_rate, reg., iterations)validation_accuracy = evaluate(model)if validation_accuracy > best_val:

best_val = validation_accuracybest_model = model

Prof. Leal-Taixé and Prof. Niessner 43

Page 44: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Cross Validation• Example: k=5

Prof. Leal-Taixé and Prof. Niessner 44

Figure extracted from cs231n

Page 45: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Cross Validation

Prof. Leal-Taixé and Prof. Niessner 45

• Used when data set is extremely small and/or our method of choice has low training times

• Partition data into k subsets, train on k-1 and evaluate performance on the remaining subset

• To reduce variability: perform on different partitions and average results

Page 46: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Cross Validation

Prof. Leal-Taixé and Prof. Niessner 46

Results for k=5

Hyperparmeter value

Figure extracted from cs231n

Page 47: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Basic recipe for machine learning

47

Page 48: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Basic recipe for machine learning• Split your data

48

Find your hyperparameters

20%

train testvalidation

20%60%

Page 49: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Basic recipe for machine learning• Split your data

49

20%

train testvalidation

20%60%

Human level error …... 1%

Training set error ….... 5%

Val/test set error ….... 8%

Bias (or underfitting)

Variance (overfitting)

Page 50: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Basic recipe for machine learning

50Credits: A. NgMore on

Page 51: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

Next lecture• Monday: Deadline Ex1!

• Next Tuesday:– Discussion solution exercise and presentation exercise 2

• Next lecture on Dec 6th: – Training Neural Networks

51Prof. Leal-Taixé and Prof. Niessner

Page 52: Lecture 5 recap · •Approximate our function by a second-order Taylor series expansion ... –Instead of a plain Gradient Descent for large 𝜆, scale each component of the gradient

See you next week!

Prof. Leal-Taixé and Prof. Niessner 52