mobileye, an intel company the hebrew university of jerusalem · surrogates shai shalev-shwartz...

Post on 23-Aug-2020

9 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Surrogates

Shai Shalev-Shwartz

Mobileye, an Intel CompanyThe Hebrew University of Jerusalem

Deep Learning: Science or Alchemy. Princeton 2019

Based on joint works with Amnon Shashua, Shaked Shammah, Ohad Shamir, Jonathan Fiat,

Eran Malach, and Yonatan Wexler

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 1 / 57

Learning theory pre deep learning

PAC model: expressivity, generalization, optimization

The expressivity-generalization tradeoff is well understood by VC theoryOptimization is easy (only?) for linear models

Linear models work well in practice (AdaBoost, SVM)

Machine learning theoreticians at the time were “spoiled” by thesuccess of AdaBoost and SVM

Engineering is mainly about constructing features based on expertknowledge

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 2 / 57

Outline

1 Prelude: My Personal Journey into Deep LearningPhase I: Wow!Phase II: Hmmm ...Phase III: What ?The role of theory

2 Surrogate Model: Separating Gating from LinearityGated Linear Units (GaLU)One hidden layer GaLU networks

3 Surrogate DistributionsSGD finds latent featuresDepth Efficiency in Fractals

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 3 / 57

Back in 2014: Deep Learning is Amazing ...

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 4 / 57

Back in 2014: Deep Learning is Amazing ...

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 5 / 57

What’s Great about Deep Learning

Deep learning is essentially a “differentiable programming language”

Fast development

Accelerate computation by dedicated hardware

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 6 / 57

A differentiable programming language

less prior knowledgemore data

expert system

Shallow learning

deep networks

No Free Lunch

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 7 / 57

Back in 2015: Toward self driving beyond highways ...

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 8 / 57

Driving Policy - Challenges

Defensive/Aggressive Tradeoff — balance between cautiousnessfacing unexpected behavior of other drivers and not being toodefensive (paranoia)

Negotiation/Communication — in dense traffic we must negotiatewith other drivers/pedestrians

“The rules of breaking the rules ...”

Dealing with Uncertainty — is there a kid behind this car? Is this taxiparking or just picking up a passenger? Does this pedestrian intend tocross the street?

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 9 / 57

Deep Reinforcement Learning is Amazing ...

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 10 / 57

Outline

1 Prelude: My Personal Journey into Deep LearningPhase I: Wow!Phase II: Hmmm ...Phase III: What ?The role of theory

2 Surrogate Model: Separating Gating from LinearityGated Linear Units (GaLU)One hidden layer GaLU networks

3 Surrogate DistributionsSGD finds latent featuresDepth Efficiency in Fractals

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 11 / 57

Sometimes things do not look good ...

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 12 / 57

Typical vs. Rare Cases ...

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 14 / 57

Outline

1 Prelude: My Personal Journey into Deep LearningPhase I: Wow!Phase II: Hmmm ...Phase III: What ?The role of theory

2 Surrogate Model: Separating Gating from LinearityGated Linear Units (GaLU)One hidden layer GaLU networks

3 Surrogate DistributionsSGD finds latent featuresDepth Efficiency in Fractals

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 15 / 57

What’s Wrong with Deep Learning?

Sometimes we think it works, but it actually doesn’t ...

Sometimes it fails... What to do when it fails?

It doesn’t find the “shortest” explanation

It is hard to interpret what exactly the solution is

... and many more problems (e.g. Marcus’2017, Yuille & Liu’2019,Jia & Liang’2017)

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 16 / 57

“tried DL, it doesn’t work for my problem”

The most uninformative sentence a Professor/CTO can hear: “I’vetried X but it didn’t work”

Usually, if you understand “X” you can find out what went wrong

Very difficult to understand what went wrong if you don’t have a cluehow “X” works ...

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 17 / 57

What to do when things do not work...

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 18 / 57

Piecewise-linear Curves

Problem: Train a piecewise-linear curve detector

Input: f = (f(0), f(1), . . . , f(n− 1)) where

f(x) =

k∑r=1

ar[x− θr]+ , θr ∈ {0, . . . , n− 1}

Output: Curve parameters {ar, θr}kr=1

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 19 / 57

First try: Deep AutoEncoder

Encoding network, Ew1 : Dense(500,relu)-Dense(100,relu)-Dense(2k)

Decoding network, Dw2 : Dense(100,relu)-Dense(100,relu)-Dense(n)

Squared Loss: (Dw2(Ew1(f))− f)2

Doesn’t work well ...

500 iterations 10,000 iterations 50,000 iterations

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 20 / 57

First try: Deep AutoEncoder

Encoding network, Ew1 : Dense(500,relu)-Dense(100,relu)-Dense(2k)

Decoding network, Dw2 : Dense(100,relu)-Dense(100,relu)-Dense(n)

Squared Loss: (Dw2(Ew1(f))− f)2

Doesn’t work well ...

500 iterations 10,000 iterations 50,000 iterations

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 20 / 57

It doesn’t find the “simplest” explanation

Can DL solve simple word math problems like:“There are 3 eggs in each box. How many eggs are in 2 boxes?”

Lets start with a much simpler problem: how much is 3 by 2 ?

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 21 / 57

It doesn’t find the “simplest” explanation

Can DL solve simple word math problems like:“There are 3 eggs in each box. How many eggs are in 2 boxes?”

Lets start with a much simpler problem: how much is 3 by 2 ?

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 21 / 57

Learning to Multiply

1 2 3 4 5

1 1 3 51 3 5

2

3 3 9 153 9 15

4

5 5 15 255 15 25

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 22 / 57

Learning to Multiply

1 2 3 4 5

1 1 2 3 4 51 2 3 7 5

2 2 4 6 8 103 21 11 34 21

3 3 6 9 12 153 12 9 22 15

4 4 8 12 16 2010 35 23 53 37

5 5 10 15 20 255 23 15 36 25

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 22 / 57

Learning to Multiply

5 10

0

20

40

603i

ReLU

5 10

0

20

40

60

80

1004i

ReLU

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 23 / 57

Learning to Multiply: Details

Represented input as two sequences of bits (the binary representationsof the two numbers), so their product is(∑

i

ai2i

)∑j

bj2j

=∑i,j

aibj2i+j =

∑i,j

[ai + bj − 1]+ 2i+j

Can represent product with one hidden layer ReLU network

Even though a simple solution exists, SGD learns a completelydifferent solution

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 24 / 57

Outline

1 Prelude: My Personal Journey into Deep LearningPhase I: Wow!Phase II: Hmmm ...Phase III: What ?The role of theory

2 Surrogate Model: Separating Gating from LinearityGated Linear Units (GaLU)One hidden layer GaLU networks

3 Surrogate DistributionsSGD finds latent featuresDepth Efficiency in Fractals

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 25 / 57

The Role of Theory

We miss a theoretical understanding of what’s really going on:

Inductive bias: What is the inductive bias? How does it depend onthe architecture, the initialization, and the optimization algorithm?

Generalization: does “understanding deep learning requirere-thinking generalization (Zhang et al)” ?

Optimization: When and how gradient-based optimization works?

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 26 / 57

Example: Inductive bias of Initialization and Optimization

Theorem (Informal version of S., Shamir, Shammah 2016)

Fix some architecture. Any initialization scheme of a neural networkinduces an order over the set of functions expressible by the architecture,and gradient-based optimization is blind to all functions which are far inthe induced order

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 27 / 57

Plan of Attack

Plan of attack — Surrogates:

Build simpler models that behave similarly to DL on specific tasks,analyze them theoretically, reflect on the behavior of DL, thereforegaining insight on inductive bias, generalization, and potential failurepoints

Build simpler distributions for which we understand the behavior ofDL and which give us insight as to when and how DL works

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 28 / 57

Outline

1 Prelude: My Personal Journey into Deep LearningPhase I: Wow!Phase II: Hmmm ...Phase III: What ?The role of theory

2 Surrogate Model: Separating Gating from LinearityGated Linear Units (GaLU)One hidden layer GaLU networks

3 Surrogate DistributionsSGD finds latent featuresDepth Efficiency in Fractals

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 29 / 57

Gated Linear Units (GaLU)

Rectified Linear Unit (ReLU):

fw(x) = max{w>x , 0} = (1w>x>0) ·(w>x

)

Gated Linear Unit (GaLU):

gw,u(x) = (1u>x>0) ·(w>x

)

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 30 / 57

Gated Linear Units (GaLU)

Rectified Linear Unit (ReLU):

fw(x) = max{w>x , 0} = (1w>x>0) ·(w>x

)

Gated Linear Unit (GaLU):

gw,u(x) = (1u>x>0) ·(w>x

)

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 30 / 57

ReLU and GaLU on multiplication table

5 100

20

40

60 3iReLU

5 100

50

100 4iReLU

5 100

20

403i

GaLU

5 100

20

40

60

80 4iGaLU

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 31 / 57

ReLU, GaLU, and Linear networks on simple datasets

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 32 / 57

One hidden layer GaLU networks, single output

Optimization:

The gates are un-trainable by SGD (the gradient is always zero)

The weights are provably trainable (convex problem)

N(x) =∑j

αjgwj ,uj (x) =∑j

gαjwj ,uj (x) =∑j

gw̃j ,uj (x)

=∑j

(1u>j x>0

)·(w̃>j x

)

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 33 / 57

One hidden layer GaLU networks, multiple outputs

Optimization:

A convex problem with low rank constraints

Approximations are well known, but the constraints do not seem tomatter

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 34 / 57

One hidden layer: Expressivity

Every ReLU network can be expressed by a GaLU network (simply setuj = wj)

But, we do not optimize over uj

So, the question is: what can be expressed by a GaLU network,assuming that uj are fixed to their initialization?

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 35 / 57

GaLU networks can Memorize

A simple measure of model capacity: can we memorize a randomsample?

Theorem

Assume m random examples and consider GaLU network with kneurons and let d be the input dimension. Then,

E[minwLS(w)

]= 1− rank(X̄)

m

where X̄ ∈ Rm,kd is composed of k repetitions of the data matrixgated by the random GaLU filters.

The rank is approximately full, meaning that we can memorize a

dataset if kd∼> m

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 36 / 57

GaLU vs. ReLU for memorization

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 37 / 57

Clustered Piece-wise Linear Models

GaLU networks can be shown to be optimal for highly clusteredpiece-wise linear distributions

But, even if the data comes exactly from such distribution, thelearned weights are a random linear combination of the “real” weights

A hint as to why DNNs are uninterpretable

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 38 / 57

Rethinking Generalization

Zhang el al: Understanding deep learning requires re-thinkinggeneralization

Learning theory bounds usually take the form

LD(h) ≤ LS(h) +

(c(H)

m

)pwhere c(H) is some capacity measure of the hypothesis class

Zhang el al showed that SGD on the same network architectureoverfit random labels and generalize on “true” labels

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 39 / 57

Rethinking Generalization

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 40 / 57

Rethinking Generalization

Not surprising for Linear Regression: need only d examples to fit alinear model if there is no noise at all, and need d/ε examples to get εerror in the noisy case

Rosset and Tibshirani 2018: need roughly σ2d/ε examples to get εerror for noisy linear model

Since GaLU are almost linear regression model, it may be possible toderive similar bounds to GaLU networks

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 41 / 57

Outline

1 Prelude: My Personal Journey into Deep LearningPhase I: Wow!Phase II: Hmmm ...Phase III: What ?The role of theory

2 Surrogate Model: Separating Gating from LinearityGated Linear Units (GaLU)One hidden layer GaLU networks

3 Surrogate DistributionsSGD finds latent featuresDepth Efficiency in Fractals

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 42 / 57

Surrogate Distributions

We know that in the worst case DL cannot work

On some real datasets, DL works pretty good

Since modeling real data is hard, lets search for surrogate distributions

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 43 / 57

Surrogate Distributions

What is a good surrogate distribution:

Objective 1: “identify the set of distributions for which DL works”

too hard

Objective 2: “identify a set of distributions which contains some realdistributions and for which DL provably works”

still too hard

Objective 3: “identify a set of distributions which doesn’t necessarilycontain any interesting real distribution but gives us insight on whenand how DL works”

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 44 / 57

Surrogate Distributions

What is a good surrogate distribution:

Objective 1: “identify the set of distributions for which DL works”

too hard

Objective 2: “identify a set of distributions which contains some realdistributions and for which DL provably works”

still too hard

Objective 3: “identify a set of distributions which doesn’t necessarilycontain any interesting real distribution but gives us insight on whenand how DL works”

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 44 / 57

Surrogate Distributions

What is a good surrogate distribution:

Objective 1: “identify the set of distributions for which DL works”

too hard

Objective 2: “identify a set of distributions which contains some realdistributions and for which DL provably works”

still too hard

Objective 3: “identify a set of distributions which doesn’t necessarilycontain any interesting real distribution but gives us insight on whenand how DL works”

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 44 / 57

Surrogate Distributions

What is a good surrogate distribution:

Objective 1: “identify the set of distributions for which DL works”

too hard

Objective 2: “identify a set of distributions which contains some realdistributions and for which DL provably works”

still too hard

Objective 3: “identify a set of distributions which doesn’t necessarilycontain any interesting real distribution but gives us insight on whenand how DL works”

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 44 / 57

Surrogate Distributions

What is a good surrogate distribution:

Objective 1: “identify the set of distributions for which DL works”

too hard

Objective 2: “identify a set of distributions which contains some realdistributions and for which DL provably works”

still too hard

Objective 3: “identify a set of distributions which doesn’t necessarilycontain any interesting real distribution but gives us insight on whenand how DL works”

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 44 / 57

Outline

1 Prelude: My Personal Journey into Deep LearningPhase I: Wow!Phase II: Hmmm ...Phase III: What ?The role of theory

2 Surrogate Model: Separating Gating from LinearityGated Linear Units (GaLU)One hidden layer GaLU networks

3 Surrogate DistributionsSGD finds latent featuresDepth Efficiency in Fractals

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 45 / 57

Deep Generative Model for (synthetic) images

Given a label, generate a small scale image, where each “pixel”represents a semantic class (e.g. sky, grass, ...)

Given a semantic small image, generate a larger semantic image bysampling a semantic “patch” from each semantic “pixel”

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 46 / 57

Deep Generative Model for (synthetic) images

y

Dy

Cm0

x(0)1

x(0)2

x(0)3

x(0)4

G1

Cms1

. . .

. . .

. . .

x(1)1

x(1)2

x(1)3

x(1)4

· · ·

· · ·

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 47 / 57

SGD finds latent features

Theorem (Informal, Malach and S.)

Training a two-layer Conv net on an intermediate image using SGD willimplicitly learn an embedding of the observed patches into a space suchthat patches from the same semantic class are close to each other, whilepatches from different classes are far.

Enables layer-by-layer reconstruction of the latent features

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 48 / 57

SGD finds latent features

Theorem (Informal, Malach and S.)

Training a two-layer Conv net on an intermediate image using SGD willimplicitly learn an embedding of the observed patches into a space suchthat patches from the same semantic class are close to each other, whilepatches from different classes are far.

Enables layer-by-layer reconstruction of the latent features

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 48 / 57

Outline

1 Prelude: My Personal Journey into Deep LearningPhase I: Wow!Phase II: Hmmm ...Phase III: What ?The role of theory

2 Surrogate Model: Separating Gating from LinearityGated Linear Units (GaLU)One hidden layer GaLU networks

3 Surrogate DistributionsSGD finds latent featuresDepth Efficiency in Fractals

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 49 / 57

Depth Efficiency

Basic question: on which distributions deeper networks are muchbetter than shallow ones?

Many recent results show depth efficiency:There exist functions which can be expressed by a small deep networkbut must have an exponential depth in order to be expressed by ashallow network

Even though such functions exist, it doesn’t mean SGD can find them

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 50 / 57

Depth Efficiency

Basic question: on which distributions deeper networks are muchbetter than shallow ones?

Many recent results show depth efficiency:There exist functions which can be expressed by a small deep networkbut must have an exponential depth in order to be expressed by ashallow network

Even though such functions exist, it doesn’t mean SGD can find them

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 50 / 57

Depth Efficiency

Basic question: on which distributions deeper networks are muchbetter than shallow ones?

Many recent results show depth efficiency:There exist functions which can be expressed by a small deep networkbut must have an exponential depth in order to be expressed by ashallow network

Even though such functions exist, it doesn’t mean SGD can find them

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 50 / 57

Fractal Distributions

Iterated Fractal Distribution:

K0 = [−1, 1]d

Kn = F1(Kn−1) ∪ . . . ∪ Fn(Kn−1)

The “depth” of the fractal is n

A “fractal distribution” is a distribution in which positive examplesare sampled from the set Kn and negative examples are sampled fromits complement

K0 K1

F1

F2

F3

F4

F1

F2

F3

F4

K2

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 51 / 57

Depth Separation

If Fi are affine, a network of depth O(n) can express a depth nfractal, but a shallow network requires exponential width

Approximation curve: How well does a network of depth t express adepth n fractal.

We show that SGD works only if the approximation curve is “nice”

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 52 / 57

Approximation Curve: coarse vs. fine

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 53 / 57

Success of SGD depends on the Approximation Curve

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 54 / 57

One Dimensional Cantor Fractals

Nc1,c2,c3,b3(x) = | | |x− c1| − c2| − c3| − b3

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 55 / 57

Summary

We lack a good understanding of basic properties of DL:

How to control the inductive biasHow does it depend on the initialization and optimization algorithm

A pursuit after good surrogates

Surrogates algorithmsSurrogates distributions

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 56 / 57

Summary

In practice, Deep Learning is just one important piece of a bigger puzzel.

Shai Shalev-Shwartz (ME, HUJI) Surrogates 2019 57 / 57

top related