Surrogates
Shai Shalev-Shwartz
Mobileye, an Intel CompanyThe Hebrew University of Jerusalem
Deep Learning: Science or Alchemy. Princeton 2019
Based on joint works with Amnon Shashua, Shaked Shammah, Ohad Shamir, Jonathan Fiat,
Eran Malach, and Yonatan Wexler
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 1 / 57
Learning theory pre deep learning
PAC model: expressivity, generalization, optimization
The expressivity-generalization tradeoff is well understood by VC theoryOptimization is easy (only?) for linear models
Linear models work well in practice (AdaBoost, SVM)
Machine learning theoreticians at the time were “spoiled” by thesuccess of AdaBoost and SVM
Engineering is mainly about constructing features based on expertknowledge
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 2 / 57
Outline
1 Prelude: My Personal Journey into Deep LearningPhase I: Wow!Phase II: Hmmm ...Phase III: What ?The role of theory
2 Surrogate Model: Separating Gating from LinearityGated Linear Units (GaLU)One hidden layer GaLU networks
3 Surrogate DistributionsSGD finds latent featuresDepth Efficiency in Fractals
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 3 / 57
Back in 2014: Deep Learning is Amazing ...
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 4 / 57
Back in 2014: Deep Learning is Amazing ...
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 5 / 57
What’s Great about Deep Learning
Deep learning is essentially a “differentiable programming language”
Fast development
Accelerate computation by dedicated hardware
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 6 / 57
A differentiable programming language
less prior knowledgemore data
expert system
Shallow learning
deep networks
No Free Lunch
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 7 / 57
Back in 2015: Toward self driving beyond highways ...
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 8 / 57
Driving Policy - Challenges
Defensive/Aggressive Tradeoff — balance between cautiousnessfacing unexpected behavior of other drivers and not being toodefensive (paranoia)
Negotiation/Communication — in dense traffic we must negotiatewith other drivers/pedestrians
“The rules of breaking the rules ...”
Dealing with Uncertainty — is there a kid behind this car? Is this taxiparking or just picking up a passenger? Does this pedestrian intend tocross the street?
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 9 / 57
Deep Reinforcement Learning is Amazing ...
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 10 / 57
Outline
1 Prelude: My Personal Journey into Deep LearningPhase I: Wow!Phase II: Hmmm ...Phase III: What ?The role of theory
2 Surrogate Model: Separating Gating from LinearityGated Linear Units (GaLU)One hidden layer GaLU networks
3 Surrogate DistributionsSGD finds latent featuresDepth Efficiency in Fractals
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 11 / 57
Sometimes things do not look good ...
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 12 / 57
You may think it works, but it doesn’t ...
Eric Schmidt: “Well, computer vision is a solved problem comparedto human vision”
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 13 / 57
You may think it works, but it doesn’t ...
Eric Schmidt: “Well, computer vision is a solved problem comparedto human vision”
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 13 / 57
Typical vs. Rare Cases ...
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 14 / 57
Outline
1 Prelude: My Personal Journey into Deep LearningPhase I: Wow!Phase II: Hmmm ...Phase III: What ?The role of theory
2 Surrogate Model: Separating Gating from LinearityGated Linear Units (GaLU)One hidden layer GaLU networks
3 Surrogate DistributionsSGD finds latent featuresDepth Efficiency in Fractals
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 15 / 57
What’s Wrong with Deep Learning?
Sometimes we think it works, but it actually doesn’t ...
Sometimes it fails... What to do when it fails?
It doesn’t find the “shortest” explanation
It is hard to interpret what exactly the solution is
... and many more problems (e.g. Marcus’2017, Yuille & Liu’2019,Jia & Liang’2017)
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 16 / 57
“tried DL, it doesn’t work for my problem”
The most uninformative sentence a Professor/CTO can hear: “I’vetried X but it didn’t work”
Usually, if you understand “X” you can find out what went wrong
Very difficult to understand what went wrong if you don’t have a cluehow “X” works ...
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 17 / 57
What to do when things do not work...
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 18 / 57
Piecewise-linear Curves
Problem: Train a piecewise-linear curve detector
Input: f = (f(0), f(1), . . . , f(n− 1)) where
f(x) =
k∑r=1
ar[x− θr]+ , θr ∈ {0, . . . , n− 1}
Output: Curve parameters {ar, θr}kr=1
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 19 / 57
First try: Deep AutoEncoder
Encoding network, Ew1 : Dense(500,relu)-Dense(100,relu)-Dense(2k)
Decoding network, Dw2 : Dense(100,relu)-Dense(100,relu)-Dense(n)
Squared Loss: (Dw2(Ew1(f))− f)2
Doesn’t work well ...
500 iterations 10,000 iterations 50,000 iterations
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 20 / 57
First try: Deep AutoEncoder
Encoding network, Ew1 : Dense(500,relu)-Dense(100,relu)-Dense(2k)
Decoding network, Dw2 : Dense(100,relu)-Dense(100,relu)-Dense(n)
Squared Loss: (Dw2(Ew1(f))− f)2
Doesn’t work well ...
500 iterations 10,000 iterations 50,000 iterations
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 20 / 57
It doesn’t find the “simplest” explanation
Can DL solve simple word math problems like:“There are 3 eggs in each box. How many eggs are in 2 boxes?”
Lets start with a much simpler problem: how much is 3 by 2 ?
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 21 / 57
It doesn’t find the “simplest” explanation
Can DL solve simple word math problems like:“There are 3 eggs in each box. How many eggs are in 2 boxes?”
Lets start with a much simpler problem: how much is 3 by 2 ?
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 21 / 57
Learning to Multiply
1 2 3 4 5
1 1 3 51 3 5
2
3 3 9 153 9 15
4
5 5 15 255 15 25
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 22 / 57
Learning to Multiply
1 2 3 4 5
1 1 2 3 4 51 2 3 7 5
2 2 4 6 8 103 21 11 34 21
3 3 6 9 12 153 12 9 22 15
4 4 8 12 16 2010 35 23 53 37
5 5 10 15 20 255 23 15 36 25
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 22 / 57
Learning to Multiply
5 10
0
20
40
603i
ReLU
5 10
0
20
40
60
80
1004i
ReLU
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 23 / 57
Learning to Multiply: Details
Represented input as two sequences of bits (the binary representationsof the two numbers), so their product is(∑
i
ai2i
)∑j
bj2j
=∑i,j
aibj2i+j =
∑i,j
[ai + bj − 1]+ 2i+j
Can represent product with one hidden layer ReLU network
Even though a simple solution exists, SGD learns a completelydifferent solution
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 24 / 57
Outline
1 Prelude: My Personal Journey into Deep LearningPhase I: Wow!Phase II: Hmmm ...Phase III: What ?The role of theory
2 Surrogate Model: Separating Gating from LinearityGated Linear Units (GaLU)One hidden layer GaLU networks
3 Surrogate DistributionsSGD finds latent featuresDepth Efficiency in Fractals
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 25 / 57
The Role of Theory
We miss a theoretical understanding of what’s really going on:
Inductive bias: What is the inductive bias? How does it depend onthe architecture, the initialization, and the optimization algorithm?
Generalization: does “understanding deep learning requirere-thinking generalization (Zhang et al)” ?
Optimization: When and how gradient-based optimization works?
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 26 / 57
Example: Inductive bias of Initialization and Optimization
Theorem (Informal version of S., Shamir, Shammah 2016)
Fix some architecture. Any initialization scheme of a neural networkinduces an order over the set of functions expressible by the architecture,and gradient-based optimization is blind to all functions which are far inthe induced order
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 27 / 57
Plan of Attack
Plan of attack — Surrogates:
Build simpler models that behave similarly to DL on specific tasks,analyze them theoretically, reflect on the behavior of DL, thereforegaining insight on inductive bias, generalization, and potential failurepoints
Build simpler distributions for which we understand the behavior ofDL and which give us insight as to when and how DL works
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 28 / 57
Outline
1 Prelude: My Personal Journey into Deep LearningPhase I: Wow!Phase II: Hmmm ...Phase III: What ?The role of theory
2 Surrogate Model: Separating Gating from LinearityGated Linear Units (GaLU)One hidden layer GaLU networks
3 Surrogate DistributionsSGD finds latent featuresDepth Efficiency in Fractals
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 29 / 57
Gated Linear Units (GaLU)
Rectified Linear Unit (ReLU):
fw(x) = max{w>x , 0} = (1w>x>0) ·(w>x
)
Gated Linear Unit (GaLU):
gw,u(x) = (1u>x>0) ·(w>x
)
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 30 / 57
Gated Linear Units (GaLU)
Rectified Linear Unit (ReLU):
fw(x) = max{w>x , 0} = (1w>x>0) ·(w>x
)
Gated Linear Unit (GaLU):
gw,u(x) = (1u>x>0) ·(w>x
)
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 30 / 57
ReLU and GaLU on multiplication table
5 100
20
40
60 3iReLU
5 100
50
100 4iReLU
5 100
20
403i
GaLU
5 100
20
40
60
80 4iGaLU
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 31 / 57
ReLU, GaLU, and Linear networks on simple datasets
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 32 / 57
One hidden layer GaLU networks, single output
Optimization:
The gates are un-trainable by SGD (the gradient is always zero)
The weights are provably trainable (convex problem)
N(x) =∑j
αjgwj ,uj (x) =∑j
gαjwj ,uj (x) =∑j
gw̃j ,uj (x)
=∑j
(1u>j x>0
)·(w̃>j x
)
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 33 / 57
One hidden layer GaLU networks, multiple outputs
Optimization:
A convex problem with low rank constraints
Approximations are well known, but the constraints do not seem tomatter
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 34 / 57
One hidden layer: Expressivity
Every ReLU network can be expressed by a GaLU network (simply setuj = wj)
But, we do not optimize over uj
So, the question is: what can be expressed by a GaLU network,assuming that uj are fixed to their initialization?
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 35 / 57
GaLU networks can Memorize
A simple measure of model capacity: can we memorize a randomsample?
Theorem
Assume m random examples and consider GaLU network with kneurons and let d be the input dimension. Then,
E[minwLS(w)
]= 1− rank(X̄)
m
where X̄ ∈ Rm,kd is composed of k repetitions of the data matrixgated by the random GaLU filters.
The rank is approximately full, meaning that we can memorize a
dataset if kd∼> m
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 36 / 57
GaLU vs. ReLU for memorization
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 37 / 57
Clustered Piece-wise Linear Models
GaLU networks can be shown to be optimal for highly clusteredpiece-wise linear distributions
But, even if the data comes exactly from such distribution, thelearned weights are a random linear combination of the “real” weights
A hint as to why DNNs are uninterpretable
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 38 / 57
Rethinking Generalization
Zhang el al: Understanding deep learning requires re-thinkinggeneralization
Learning theory bounds usually take the form
LD(h) ≤ LS(h) +
(c(H)
m
)pwhere c(H) is some capacity measure of the hypothesis class
Zhang el al showed that SGD on the same network architectureoverfit random labels and generalize on “true” labels
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 39 / 57
Rethinking Generalization
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 40 / 57
Rethinking Generalization
Not surprising for Linear Regression: need only d examples to fit alinear model if there is no noise at all, and need d/ε examples to get εerror in the noisy case
Rosset and Tibshirani 2018: need roughly σ2d/ε examples to get εerror for noisy linear model
Since GaLU are almost linear regression model, it may be possible toderive similar bounds to GaLU networks
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 41 / 57
Outline
1 Prelude: My Personal Journey into Deep LearningPhase I: Wow!Phase II: Hmmm ...Phase III: What ?The role of theory
2 Surrogate Model: Separating Gating from LinearityGated Linear Units (GaLU)One hidden layer GaLU networks
3 Surrogate DistributionsSGD finds latent featuresDepth Efficiency in Fractals
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 42 / 57
Surrogate Distributions
We know that in the worst case DL cannot work
On some real datasets, DL works pretty good
Since modeling real data is hard, lets search for surrogate distributions
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 43 / 57
Surrogate Distributions
What is a good surrogate distribution:
Objective 1: “identify the set of distributions for which DL works”
too hard
Objective 2: “identify a set of distributions which contains some realdistributions and for which DL provably works”
still too hard
Objective 3: “identify a set of distributions which doesn’t necessarilycontain any interesting real distribution but gives us insight on whenand how DL works”
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 44 / 57
Surrogate Distributions
What is a good surrogate distribution:
Objective 1: “identify the set of distributions for which DL works”
too hard
Objective 2: “identify a set of distributions which contains some realdistributions and for which DL provably works”
still too hard
Objective 3: “identify a set of distributions which doesn’t necessarilycontain any interesting real distribution but gives us insight on whenand how DL works”
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 44 / 57
Surrogate Distributions
What is a good surrogate distribution:
Objective 1: “identify the set of distributions for which DL works”
too hard
Objective 2: “identify a set of distributions which contains some realdistributions and for which DL provably works”
still too hard
Objective 3: “identify a set of distributions which doesn’t necessarilycontain any interesting real distribution but gives us insight on whenand how DL works”
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 44 / 57
Surrogate Distributions
What is a good surrogate distribution:
Objective 1: “identify the set of distributions for which DL works”
too hard
Objective 2: “identify a set of distributions which contains some realdistributions and for which DL provably works”
still too hard
Objective 3: “identify a set of distributions which doesn’t necessarilycontain any interesting real distribution but gives us insight on whenand how DL works”
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 44 / 57
Surrogate Distributions
What is a good surrogate distribution:
Objective 1: “identify the set of distributions for which DL works”
too hard
Objective 2: “identify a set of distributions which contains some realdistributions and for which DL provably works”
still too hard
Objective 3: “identify a set of distributions which doesn’t necessarilycontain any interesting real distribution but gives us insight on whenand how DL works”
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 44 / 57
Outline
1 Prelude: My Personal Journey into Deep LearningPhase I: Wow!Phase II: Hmmm ...Phase III: What ?The role of theory
2 Surrogate Model: Separating Gating from LinearityGated Linear Units (GaLU)One hidden layer GaLU networks
3 Surrogate DistributionsSGD finds latent featuresDepth Efficiency in Fractals
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 45 / 57
Deep Generative Model for (synthetic) images
Given a label, generate a small scale image, where each “pixel”represents a semantic class (e.g. sky, grass, ...)
Given a semantic small image, generate a larger semantic image bysampling a semantic “patch” from each semantic “pixel”
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 46 / 57
Deep Generative Model for (synthetic) images
y
Dy
∈
Cm0
x(0)1
x(0)2
x(0)3
x(0)4
G1
∈
Cms1
. . .
. . .
. . .
x(1)1
x(1)2
x(1)3
x(1)4
· · ·
· · ·
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 47 / 57
SGD finds latent features
Theorem (Informal, Malach and S.)
Training a two-layer Conv net on an intermediate image using SGD willimplicitly learn an embedding of the observed patches into a space suchthat patches from the same semantic class are close to each other, whilepatches from different classes are far.
Enables layer-by-layer reconstruction of the latent features
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 48 / 57
SGD finds latent features
Theorem (Informal, Malach and S.)
Training a two-layer Conv net on an intermediate image using SGD willimplicitly learn an embedding of the observed patches into a space suchthat patches from the same semantic class are close to each other, whilepatches from different classes are far.
Enables layer-by-layer reconstruction of the latent features
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 48 / 57
Outline
1 Prelude: My Personal Journey into Deep LearningPhase I: Wow!Phase II: Hmmm ...Phase III: What ?The role of theory
2 Surrogate Model: Separating Gating from LinearityGated Linear Units (GaLU)One hidden layer GaLU networks
3 Surrogate DistributionsSGD finds latent featuresDepth Efficiency in Fractals
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 49 / 57
Depth Efficiency
Basic question: on which distributions deeper networks are muchbetter than shallow ones?
Many recent results show depth efficiency:There exist functions which can be expressed by a small deep networkbut must have an exponential depth in order to be expressed by ashallow network
Even though such functions exist, it doesn’t mean SGD can find them
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 50 / 57
Depth Efficiency
Basic question: on which distributions deeper networks are muchbetter than shallow ones?
Many recent results show depth efficiency:There exist functions which can be expressed by a small deep networkbut must have an exponential depth in order to be expressed by ashallow network
Even though such functions exist, it doesn’t mean SGD can find them
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 50 / 57
Depth Efficiency
Basic question: on which distributions deeper networks are muchbetter than shallow ones?
Many recent results show depth efficiency:There exist functions which can be expressed by a small deep networkbut must have an exponential depth in order to be expressed by ashallow network
Even though such functions exist, it doesn’t mean SGD can find them
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 50 / 57
Fractal Distributions
Iterated Fractal Distribution:
K0 = [−1, 1]d
Kn = F1(Kn−1) ∪ . . . ∪ Fn(Kn−1)
The “depth” of the fractal is n
A “fractal distribution” is a distribution in which positive examplesare sampled from the set Kn and negative examples are sampled fromits complement
K0 K1
F1
F2
F3
F4
F1
F2
F3
F4
K2
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 51 / 57
Depth Separation
If Fi are affine, a network of depth O(n) can express a depth nfractal, but a shallow network requires exponential width
Approximation curve: How well does a network of depth t express adepth n fractal.
We show that SGD works only if the approximation curve is “nice”
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 52 / 57
Approximation Curve: coarse vs. fine
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 53 / 57
Success of SGD depends on the Approximation Curve
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 54 / 57
One Dimensional Cantor Fractals
Nc1,c2,c3,b3(x) = | | |x− c1| − c2| − c3| − b3
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 55 / 57
Summary
We lack a good understanding of basic properties of DL:
How to control the inductive biasHow does it depend on the initialization and optimization algorithm
A pursuit after good surrogates
Surrogates algorithmsSurrogates distributions
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 56 / 57
Summary
In practice, Deep Learning is just one important piece of a bigger puzzel.
Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 57 / 57