learning curve in modern ml the 'double descent' behavior

Learning curve in modern MLThe ”double descent” behavior

Oren Yuval

Department of Statistics and Operations Research

Tel-Aviv University

December 23, 2019

Oren Yuval Learning curve in modern ML , The ”double descent” behavior 1 / 47

Outline

1 Background

2 Main findings

3 Real data simulations

4 Theoretical analysis for Least-Squares

5 Summery

Background

The classical learning curve

We all know the Bias-Variance Trade-Off:

The common knowledge - very low training error → very high variance

One may think of some criteria for finding the optimal model

Background

Is NN immune to variance?

On the other hand - very rich models such as NN are trained to exactly fitthe train data, and yet they obtain high accuracy on test data

How can we reconcile the modern practice with the classical bias-variancetrade-off?

Background

Is NN immune to variance?

On the other hand - very rich models such as NN are trained to exactly fitthe train data, and yet they obtain high accuracy on test data

How can we reconcile the modern practice with the classical bias-variancetrade-off?

Background

A ”jamming transition”

Spigler in mid 2018, argued that in fully-connected networks, a phasetransition delimits the over and under-parametrized regimes where fittingcan or cannot be achieved

Background

Reconciling the discrepancy

Just one year ago - Belkin et al first analyzed the behavior of rich modelsaround the interpolation point

Since then many authors published results, that justified the innovativeapproachHastie, Rosset, and Tibshirani provided a precise quantitative explanationfor the potential benefits of over-parametrization in linear regression

Background

Since then many authors published results, that justified the innovativeapproach

Hastie, Rosset, and Tibshirani provided a precise quantitative explanationfor the potential benefits of over-parametrization in linear regression

Background

Since then many authors published results, that justified the innovativeapproachHastie, Rosset, and Tibshirani provided a precise quantitative explanationfor the potential benefits of over-parametrization in linear regression

Background

Empirical risk minimization

Given a training sample (x1, y1), · · · , (xn, yn), where (xi , yi ) ∈ Rd × R,

we learn a predictor hn : Rd → R.

In ERM, the predictor is taken to be

hn = argminh∈H

n∑i=1

l (yi , h(xi ))

Empirical risk (train error): 1n

∑ni=1 l (yi , hn(xi ))

Interpolation: l (yi , hn(xi )) = 0 ∀i

True risk (test error): Ex ,y [l (yi , hn(xi ))]

Background

Empirical risk minimization

Given a training sample (x1, y1), · · · , (xn, yn), where (xi , yi ) ∈ Rd × R,

we learn a predictor hn : Rd → R.

In ERM, the predictor is taken to be

hn = argminh∈H

n∑i=1

l (yi , h(xi ))

Empirical risk (train error): 1n

∑ni=1 l (yi , hn(xi ))

Interpolation: l (yi , hn(xi )) = 0 ∀i

True risk (test error): Ex ,y [l (yi , hn(xi ))]

Background

Controlling H

Conventional wisdom in machine learning suggests controlling the capacityof H:

H too small → under-fitting (large empirical and true risk)

H too large → over-fitting (small empirical risk but large true risk)

Example (OLS)

β̂n = argminβ∈Rp

n∑i=1

(yi − xTi β

True risk ∝ pn−p

Yet, best practice in DL: network should be large enough to permiteffortless zero train-loss

Background

Controlling H

Example (OLS)

n∑i=1

(yi − xTi β

Background

Controlling H

Example (OLS)

n∑i=1

(yi − xTi β

Main findings

The “double descent” risk curve

The main finding of Belkin’s work is summarized in the “double descent”risk curve:

This is demonstrated on important model classes including neural networksand a range of real data sets

The capacity of H is identified with the number of parameters needed tospecify the function hn

Main findings

Fitting beyond the interpolation point

When zero train error can be achieved, we choose hn as follows:

hn = argminh∈H

{||hn|| s.t :

n∑i=1

l (yi , h(xi )) = 0

Looking for the simplest/smoothest function that explain the data

By increasing the capacity of H, we are able to find interpolatingfunctions that have smaller norm

Main findings

Fitting beyond the interpolation point

When zero train error can be achieved, we choose hn as follows:

hn = argminh∈H

{||hn|| s.t :

n∑i=1

l (yi , h(xi )) = 0

Looking for the simplest/smoothest function that explain the data

By increasing the capacity of H, we are able to find interpolatingfunctions that have smaller norm

Main findings

Why is the “double descent” important?

Stating that interpolation does not necessarily lead to poorgeneralization, as long as you ”deep” enough in the interpolationregime

Reconciling the modern practice with a statistical point-of-view

Explicit analysis for Linear Models

The true risk in the over-parameterized regime is typically lower!

Main findings

Real data simulations

Neural networks

One Hidden layer with N random features

Minimizing squared loss or ||a||2 when N ≥ n

x̃k = ϕ(x ; vk) = ϕ(< x , vk >) , vk ∼ MN(0, Ip)

Neural networks

Risk curve for RFF model onMNIST

Near interpolation -parameters are ”forced”to fit the training data

Increasing N results indecreasing the l2 norm ofthe predictors

Neural networks

Fully connected two-layers network with H hidden units

Optimizing the weight using SGD with up to 6 · 103 iterations:

Interpolation is not assured even in the over-parameterized regime

Automatically prefers minimal-norm solution

Sub-optimal behavior canlead to high variability inboth the training and testrisks that masks the doubledescent curve

Neural networks

Risk curves for two-Layersfully connected NN onMNIST

Train risk may increasewith increasing numberof parameters

Decision trees

It was shown by Wyner et al (2017) that AdaBoost and Random-Forestsperform better with large (interpolating) decision trees and are morerobust to noise in the training data

They questioned the conventional wisdom that suggests that boostingalgorithms for classification requires regularization/early stopping/lowcomplexity

The effect of noise-pointon a classifier:interpolating Vsnon-interpolating

Decision trees

Risk curves forRandom-Forests on MNIST

The complexity iscontrolled by the size ofa decision tree, and thenumber of trees

Averaging ofinterpolating treesensures substantiallysmoother function

Thinking about it...

The peak at the interpolation threshold is observed within a narrow rangeof parameters - sampling parameter-space out of that range may lead tothe misleading (but conventional) conclusions

The understanding of the ”double descent” behavior is important forpractitioners to choose between models for optimal performance

The peak at the interpolation threshold is observed within a narrow rangeof parameters - sampling parameter-space out of that range may lead tothe misleading (but conventional) conclusions

The understanding of the ”double descent” behavior is important forpractitioners to choose between models for optimal performance

Theoretical analysis for Least-Squares

Why Least-Squares?

Easy to analyze

Easy to explain

Easy to simulate

Any non-linear model can be approximated by a linear one with largenumber of random features