principled deep neural network training through linear … · 2019. 1. 10. · principled deep...
Post on 17-Feb-2021
4 Views
Preview:
TRANSCRIPT
-
Principled Deep Neural Network Trainingthrough Linear Programming
Daniel Bienstock1, Gonzalo Muñoz2, Sebastian Pokutta3
January 9, 20191IEOR, Columbia University
2IVADO, Polytechnique Montréal
3ISyE, Georgia Tech
1
-
“...I’m starting to look at machine learning problems”
Oktay Günlük’s research interests, Aussois 2019
2
-
Goal of this talk
-
Goal of this talk
• Deep Learning is receiving significant attention due to its impressiveperformance.
• Unfortunately, only recent results regarding the complexity oftraining deep neural networks have been obtained.
• Our goal: to show that large classes of Neural Networks can betrained to near optimality using linear programs whose size is linearon the data.
3
-
Goal of this talk
• Deep Learning is receiving significant attention due to its impressiveperformance.
• Unfortunately, only recent results regarding the complexity oftraining deep neural networks have been obtained.
• Our goal: to show that large classes of Neural Networks can betrained to near optimality using linear programs whose size is linearon the data.
3
-
Goal of this talk
• Deep Learning is receiving significant attention due to its impressiveperformance.
• Unfortunately, only recent results regarding the complexity oftraining deep neural networks have been obtained.
• Our goal: to show that large classes of Neural Networks can betrained to near optimality using linear programs whose size is linearon the data.
3
-
Empirical Risk Minimization problem
Given:
• D data points (x̂i, ŷi), i = 1, . . . ,D• x̂i ∈ Rn, ŷi ∈ Rm
• A loss function ℓ : Rm × Rm → R (not necessarily convex)
Compute f : Rn → Rm to solve
minf
1D
D∑i=1
ℓ(f(x̂i), ŷi) (+ optional regularizer Φ(f))
f ∈ F (some class)
4
-
Empirical Risk Minimization problem
Given:
• D data points (x̂i, ŷi), i = 1, . . . ,D• x̂i ∈ Rn, ŷi ∈ Rm
• A loss function ℓ : Rm × Rm → R (not necessarily convex)
Compute f : Rn → Rm to solve
minf
1D
D∑i=1
ℓ(f(x̂i), ŷi) (+ optional regularizer Φ(f))
f ∈ F (some class)
4
-
Empirical Risk Minimization problem
Given:
• D data points (x̂i, ŷi), i = 1, . . . ,D• x̂i ∈ Rn, ŷi ∈ Rm
• A loss function ℓ : Rm × Rm → R (not necessarily convex)
Compute f : Rn → Rm to solve
minf
1D
D∑i=1
ℓ(f(x̂i), ŷi) (+ optional regularizer Φ(f))
f ∈ F (some class)
4
-
Empirical Risk Minimization problem
minf
1D
D∑i=1
ℓ(f(x̂i), ŷi) (+ optional regularizer Φ(f))
f ∈ F (some class)
Examples:
• Linear Regression. f(x) = Ax + b with ℓ2-loss.• Binary Classification. Varying f architectures and cross-entropy loss:
ℓ(p, y) = −y log(p)− (1 − y) log(1 − p)• Neural Networks with k layers.
f(x) = Tk+1 ◦ σ ◦ Tk ◦ σ . . . ◦ σ ◦ T1(x), each Tj affine.
5
-
Function parameterization
We assume family F (statisticians’ hypothesis) is parameterized: thereexists f such that
F = {f(x, θ) : θ ∈ Θ ⊆ [−1, 1]N}.
Thus, THE problem becomes
minθ∈Θ
1D
D∑i=1
ℓ(f(x̂i, θ), ŷi)
6
-
Function parameterization
We assume family F (statisticians’ hypothesis) is parameterized: thereexists f such that
F = {f(x, θ) : θ ∈ Θ ⊆ [−1, 1]N}.
Thus, THE problem becomes
minθ∈Θ
1D
D∑i=1
ℓ(f(x̂i, θ), ŷi)
6
-
What we know for Neural Nets
-
Neural Networks
• D data points (x̂i, ŷi), 1 ≤ i ≤ D, x̂i ∈ Rn, ŷi ∈ Rm
• f = Tk+1 ◦ σ ◦ Tk ◦ σ . . . ◦ σ ◦ T1• Each Ti affine Ti(y) = Aiy + bi• A1 is n × w, Ak+1 is w × m, Ai is w × w otherwise.
...
n w w m
7
-
Neural Networks
• D data points (x̂i, ŷi), 1 ≤ i ≤ D, x̂i ∈ Rn, ŷi ∈ Rm
• f = Tk+1 ◦ σ ◦ Tk ◦ σ . . . ◦ σ ◦ T1
• Each Ti affine Ti(y) = Aiy + bi• A1 is n × w, Ak+1 is w × m, Ai is w × w otherwise.
...
n w w m
7
-
Neural Networks
• D data points (x̂i, ŷi), 1 ≤ i ≤ D, x̂i ∈ Rn, ŷi ∈ Rm
• f = Tk+1 ◦ σ ◦ Tk ◦ σ . . . ◦ σ ◦ T1• Each Ti affine Ti(y) = Aiy + bi
• A1 is n × w, Ak+1 is w × m, Ai is w × w otherwise.
...
n w w m
7
-
Neural Networks
• D data points (x̂i, ŷi), 1 ≤ i ≤ D, x̂i ∈ Rn, ŷi ∈ Rm
• f = Tk+1 ◦ σ ◦ Tk ◦ σ . . . ◦ σ ◦ T1• Each Ti affine Ti(y) = Aiy + bi• A1 is n × w, Ak+1 is w × m, Ai is w × w otherwise.
...
n w w m
7
-
Hardness Results
Theorem (Blum and Rivest 1992)Let x̂i ∈ Rn, ŷi ∈ {0, 1}, ℓ ∈ (absolute value, 2-norm squared) and σ athreshold function. Then training is NP-hard even in this simple network:
...
Theorem (Boob, Dey and Lan 2018)Let x̂i ∈ Rn, ŷi ∈ {0, 1}, ℓ a norm and σ(t) = max{0, t} a ReLUactivation. Then training is NP-hard in the same network.
8
-
Exact Training Complexity
Theorem (Arora, Basu, Mianjy and Mukherjee 2018)If k = 1 (one “hidden layer”), m = 1 and ℓ is convex, there is an exacttraining algorithm of complexity
O ( 2wDnwpoly(D, n,w) )
Polynomial in the size of the data set, for fixed n,w.
Also in that paper:“we are not aware of any complexity results which would rule out thepossibility of an algorithm which trains to global optimality in timethat is polynomial in the data size”
“Perhaps an even better breakthrough would be to get optimal trainingalgorithms for DNNs with two or more hidden layers and this seemslike a substantially harder nut to crack”
9
-
Exact Training Complexity
Theorem (Arora, Basu, Mianjy and Mukherjee 2018)If k = 1 (one “hidden layer”), m = 1 and ℓ is convex, there is an exacttraining algorithm of complexity
O ( 2wDnwpoly(D, n,w) )
Polynomial in the size of the data set, for fixed n,w.
Also in that paper:“we are not aware of any complexity results which would rule out thepossibility of an algorithm which trains to global optimality in timethat is polynomial in the data size”
“Perhaps an even better breakthrough would be to get optimal trainingalgorithms for DNNs with two or more hidden layers and this seemslike a substantially harder nut to crack”
9
-
Exact Training Complexity
Theorem (Arora, Basu, Mianjy and Mukherjee 2018)If k = 1 (one “hidden layer”), m = 1 and ℓ is convex, there is an exacttraining algorithm of complexity
O ( 2wDnwpoly(D, n,w) )
Polynomial in the size of the data set, for fixed n,w.
Also in that paper:“we are not aware of any complexity results which would rule out thepossibility of an algorithm which trains to global optimality in timethat is polynomial in the data size”
“Perhaps an even better breakthrough would be to get optimal trainingalgorithms for DNNs with two or more hidden layers and this seemslike a substantially harder nut to crack”
9
-
What we’ll prove
There exists a polytope:
whose size depends linearly on D that encodes approximately all possibletraining problems coming from (x̂i, ŷi)Di=1 ⊆ [−1, 1](n+m)D.
Spoiler: Theory-only results
10
-
What we’ll prove
There exists a polytope:
whose size depends linearly on D
that encodes approximately all possibletraining problems coming from (x̂i, ŷi)Di=1 ⊆ [−1, 1](n+m)D.
Spoiler: Theory-only results
10
-
What we’ll prove
There exists a polytope:
whose size depends linearly on D that encodes approximately all possibletraining problems coming from (x̂i, ŷi)Di=1 ⊆ [−1, 1](n+m)D.
Spoiler: Theory-only results
10
-
What we’ll prove
There exists a polytope:
whose size depends linearly on D that encodes approximately all possibletraining problems coming from (x̂i, ŷi)Di=1 ⊆ [−1, 1](n+m)D.
Spoiler: Theory-only results
10
-
Our Hammer
-
Treewidth
Treewidth is a parameter that measures how tree-like a graph is.
DefinitionGiven a chordal graph G, we say its treewidth is ω if its clique number isω + 1.
• Trees have treewidth 1• Cycles have treewidth 2• Kn has treewidth n − 1
11
-
Treewidth
Treewidth is a parameter that measures how tree-like a graph is.
DefinitionGiven a chordal graph G, we say its treewidth is ω if its clique number isω + 1.
• Trees have treewidth 1• Cycles have treewidth 2• Kn has treewidth n − 1
11
-
Treewidth
Treewidth is a parameter that measures how tree-like a graph is.
DefinitionGiven a chordal graph G, we say its treewidth is ω if its clique number isω + 1.
• Trees have treewidth 1• Cycles have treewidth 2• Kn has treewidth n − 1
11
-
Approximate optimization of well-behaved functions
Prototype problem:
min cTxs.t. fi(x) ≤ 0, i = 1, . . . ,m
x ∈ [0, 1]n
Toolset:
• Each fi is “well-behaved”: Lipschitz constant Li over [0, 1]n• Intersection graph: An edge whenever two variables appear in the
same fi
For example:
x1 + x2 + x3 ≤ 1x3 + x4 ≥ 1
x4 · x5 + x6 ≤ 2
The intersection graph is:
1
2
3 4
5
6
12
-
Approximate optimization of well-behaved functions
Prototype problem:
min cTxs.t. fi(x) ≤ 0, i = 1, . . . ,m
x ∈ [0, 1]n
Toolset:
• Each fi is “well-behaved”: Lipschitz constant Li over [0, 1]n
• Intersection graph: An edge whenever two variables appear in thesame fi
For example:
x1 + x2 + x3 ≤ 1x3 + x4 ≥ 1
x4 · x5 + x6 ≤ 2
The intersection graph is:
1
2
3 4
5
6
12
-
Approximate optimization of well-behaved functions
Prototype problem:
min cTxs.t. fi(x) ≤ 0, i = 1, . . . ,m
x ∈ [0, 1]n
Toolset:
• Each fi is “well-behaved”: Lipschitz constant Li over [0, 1]n• Intersection graph: An edge whenever two variables appear in the
same fi
For example:
x1 + x2 + x3 ≤ 1x3 + x4 ≥ 1
x4 · x5 + x6 ≤ 2
The intersection graph is:
1
2
3 4
5
6
12
-
Approximate optimization of well-behaved functions
Prototype problem:
min cTxs.t. fi(x) ≤ 0, i = 1, . . . ,m
x ∈ [0, 1]n
Toolset:
• Each fi is “well-behaved”: Lipschitz constant Li over [0, 1]n• Intersection graph: An edge whenever two variables appear in the
same fi
For example:
x1 + x2 + x3 ≤ 1x3 + x4 ≥ 1
x4 · x5 + x6 ≤ 2
The intersection graph is:
1
2
3 4
5
6
12
-
Approximate optimization of well-behaved functions
Prototype problem:
min cTxs.t. fi(x) ≤ 0, i = 1, . . . ,m
x ∈ [0, 1]n
Toolset:
• Each fi is “well-behaved”: Lipschitz constant Li over [0, 1]n• Intersection graph: An edge whenever two variables appear in the
same fi
For example:
x1 + x2 + x3 ≤ 1x3 + x4 ≥ 1
x4 · x5 + x6 ≤ 2
The intersection graph is:
1
2
3 4
5
6
12
-
Approximate optimization of well-behaved functions
Prototype problem:
min cTxs.t. fi(x) ≤ 0, i = 1, . . . ,m
x ∈ [0, 1]n
An extension of result by Bienstock and M. 2018:TheoremSuppose the intersection graph has tree-width ω and let L = maxi Li.
Then, for every ϵ > 0 there is an LP relaxation of size
O((L/ϵ)ω+1 n
)that guarantees ϵ optimality and feasibility errors.
13
-
Approximate optimization of well-behaved functions
Prototype problem:
min cTxs.t. fi(x) ≤ 0, i = 1, . . . ,m
x ∈ [0, 1]n
An extension of result by Bienstock and M. 2018:TheoremSuppose the intersection graph has tree-width ω and let L = maxi Li.Then, for every ϵ > 0 there is an LP relaxation of size
O((L/ϵ)ω+1 n
)that guarantees ϵ optimality and feasibility errors.
13
-
Application to ERM problem
We now apply the LP approximation result to:
minθ∈Θ
1D
D∑i=1
ℓ(f(x̂i, θ), ŷi)
with Θ ⊆ [−1, 1]N, x̂i ∈ [−1, 1]n and ŷi ∈ [−1, 1]m.
We use the epigraphformulation:
minθ∈Θ
1D
D∑i=1
Li
Li ≥ ℓ(f(x̂i, θ), ŷi) 1 ≤ i ≤ D
Let L be the Lipschitz constant of g(x, y, θ) .= ℓ(f(x, θ), y) over[−1, 1]n+m+N.
14
-
Application to ERM problem
We now apply the LP approximation result to:
minθ∈Θ
1D
D∑i=1
ℓ(f(x̂i, θ), ŷi)
with Θ ⊆ [−1, 1]N, x̂i ∈ [−1, 1]n and ŷi ∈ [−1, 1]m. We use the epigraphformulation:
minθ∈Θ
1D
D∑i=1
Li
Li ≥ ℓ(f(x̂i, θ), ŷi) 1 ≤ i ≤ D
Let L be the Lipschitz constant of g(x, y, θ) .= ℓ(f(x, θ), y) over[−1, 1]n+m+N.
14
-
Application to ERM problem
TheoremFor every ϵ > 0, ℓ, Θ ⊆ [−1, 1]N and D, there is a polytope of size
O((2L/ϵ)N+n+m D
)
such that for every data set (X̂, Ŷ) = (x̂i, ŷi)Di=1 ⊆ [−1, 1](n+m)D, there isa face FX̂,Ŷ such that optimizing 1D
∑Di=1 Li over FX̂,Ŷ provides an
ϵ-approximation to ERM with data X̂, Ŷ.
15
-
Application to ERM problem
TheoremFor every ϵ > 0, ℓ, Θ ⊆ [−1, 1]N and D, there is a polytope of size
O((2L/ϵ)N+n+m D
)such that for every data set (X̂, Ŷ) = (x̂i, ŷi)Di=1 ⊆ [−1, 1](n+m)D, there isa face FX̂,Ŷ
such that optimizing 1D∑D
i=1 Li over FX̂,Ŷ provides anϵ-approximation to ERM with data X̂, Ŷ.
15
-
Application to ERM problem
TheoremFor every ϵ > 0, ℓ, Θ ⊆ [−1, 1]N and D, there is a polytope of size
O((2L/ϵ)N+n+m D
)such that for every data set (X̂, Ŷ) = (x̂i, ŷi)Di=1 ⊆ [−1, 1](n+m)D, there isa face FX̂,Ŷ such that optimizing 1D
∑Di=1 Li over FX̂,Ŷ provides an
ϵ-approximation to ERM with data X̂, Ŷ.
15
-
Proof Sketch
Every system of constraints of the type
Li ≥ ℓ(f(xi, θ), yi) 1 ≤ i ≤ D
has an intersection graph with the following structure:
θ1, · · · , θN
L1x1, y1
L2x2, y2
L3x3, y3
LDxD, yD
L4x4, y4
and has treewidth at most N + n + m
16
-
Proof Sketch
Every system of constraints of the type
Li ≥ ℓ(f(xi, θ), yi) 1 ≤ i ≤ D
has an intersection graph with the following structure:
θ1, · · · , θN
L1x1, y1
L2x2, y2
L3x3, y3
LDxD, yD
L4x4, y4
and has treewidth at most N + n + m16
-
LP size details
Thus the LP size given by the treewidth
O((L/ϵ)ω+1 n
)becomes
O((2L/ϵ)N+n+m D
)
The key lies in the fact that the D does not add to the treewidth.
Different architectures → N and L.
17
-
LP size details
Thus the LP size given by the treewidth
O((L/ϵ)ω+1 n
)becomes
O((2L/ϵ)N+n+m D
)
The key lies in the fact that the D does not add to the treewidth.
Different architectures → N and L.
17
-
LP size details
Thus the LP size given by the treewidth
O((L/ϵ)ω+1 n
)becomes
O((2L/ϵ)N+n+m D
)
The key lies in the fact that the D does not add to the treewidth.
Different architectures → N and L.
17
-
Architecture-SpecificConsequences
-
Fully connected DNN, ReLU activations, quadratic loss
For any k, n,m,w, ϵ there is a uniform LP of size
O((2k+1mnwk2/ϵ)N+n+m D
)with the same guarantees: ϵ-approximation and data-dependent faces
Core of the proof: In a DNN with k hidden layers and quadratic loss theLipschitz constant of g(x, y, θ) over [−1, 1]n+m+N is O(mnwk2).
18
-
Fully connected DNN, ReLU activations, quadratic loss
For any k, n,m,w, ϵ there is a uniform LP of size
O((2k+1mnwk2/ϵ)N+n+m D
)with the same guarantees: ϵ-approximation and data-dependent faces
Core of the proof: In a DNN with k hidden layers and quadratic loss theLipschitz constant of g(x, y, θ) over [−1, 1]n+m+N is O(mnwk2).
18
-
Comparison with Arora et al.
In the Arora, Basu, Mianjy and Mukherjee setting: k = 1,m = 1 andN ≈ nw
Arora et al. Running TimeO ( 2wDnwpoly(D, n,w) )
Uniform LP SizeO((4nw/ϵ)(n+1)(w+1) D
)
Other differences:exactness, boundedness, convexity v lipschitz-ness, uniformness
19
-
Last comments
• The results can be improved by considering the sparsity of thenetwork itself.
• One can obtain previously unknown complexity results (ResNet,Convolutional NN, etc)
• Training using this approach generalizes. Meaning, using enough1i.i.d data points we get an approximation to the “true” RiskMinimization problem. Our results improve on the bestapproximations to this problem as well.
1depends on L and ϵ
20
-
Still Open and Future Work
• It is unknown if the dependency on w or k can be improved• A better LP size can be obtained assuming more about the input
data or the nature of the problem• We would like to combine these ideas with empirically efficient
methods
21
-
Thank you!
21
-
One other improvement
If we denote G the underlying Neural Network, we can improve theexponent in
O((nw/ϵ)poly(n,k,w,m) D
)using the treewidth of G tw(G), and its maximum degree ∆(G).
More specifically, one can obtain a uniform LP of size
O((nw/ϵ)O(k·tw(G)·∆(G)) (|E(G)|+ D)
)
22
-
One other improvement
If we denote G the underlying Neural Network, we can improve theexponent in
O((nw/ϵ)poly(n,k,w,m) D
)using the treewidth of G tw(G), and its maximum degree ∆(G).
More specifically, one can obtain a uniform LP of size
O((nw/ϵ)O(k·tw(G)·∆(G)) (|E(G)|+ D)
)
22
Goal of this talkWhat we know for Neural NetsOur HammerArchitecture-Specific Consequences
top related