hanie sedghi, research scientist at allen institute for artificial intelligence, at mlconf seattle...
TRANSCRIPT
![Page 1: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/1.jpg)
Beating Perils of Non-convexity:Guaranteed Training of Neural Networks
Hanie Sedghi
Allen Institute for AI
Joint work with Majid Janzamin (Twitter)and Anima Anandkumar (UC Irvine)
![Page 2: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/2.jpg)
Introduction
Training Neural Networks
Tremendous practical impactwith deep learning.
Highly non-convex optimization
Algorithm: backpropagation.
Backpropagation can get stuckin bad local optima
Hanie Sedghi 1/ 32
![Page 3: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/3.jpg)
Introduction
Convex vs. Non-convex Optimization
Most work on convex analysis.. Most problems are nonconvex!
Image taken from https://www.facebook.com/nonconvex
Hanie Sedghi 2/ 32
![Page 4: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/4.jpg)
Introduction
Convex vs. Nonconvex Optimization
One global optima. Multiple local optima
In high dimensions possiblyexponential local optima
Hanie Sedghi 3/ 32
![Page 5: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/5.jpg)
Introduction
Convex vs. Nonconvex Optimization
One global optima. Multiple local optima
In high dimensions possiblyexponential local optima
How to deal with non-convexity?
Hanie Sedghi 3/ 32
![Page 6: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/6.jpg)
Introduction
Toy Example
y=1y=−1
Local optimum Global optimum
Labeled input samplesGoal: binary classification
σ(·) σ(·)
y
x1 x2x
Local and global optima for backpropagation
Hanie Sedghi 4/ 32
![Page 7: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/7.jpg)
Introduction
Example: bad local optima
Train Feedforward neural Network with ReLU activation function on MNIST
Local optima leads to wrong classification!
Swirszcz, Czarnecki, and Pascanu, Local minima in training of deep networks, ’16
Hanie Sedghi 5/ 32
![Page 8: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/8.jpg)
Introduction
Example: bad local optima
Bad initialization cannot be recovered by more iterations.
Bad initialization cannot be resolved with depth.
Bad initialization can hurt networks with ReLU or sigmoid.
Swirszcz, Czarnecki, and Pascanu, Local minima in training of deep networks, ’16
Hanie Sedghi 6/ 32
![Page 9: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/9.jpg)
Guaranteed Training of Neural Networks
Outline
Introduction
Guaranteed Training of Neural NetworksAlgorithmError AnalysisGeneral Framework and Extension to RNNs
Hanie Sedghi 7/ 32
![Page 10: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/10.jpg)
Guaranteed Training of Neural Networks
Outline
Introduction
Guaranteed Training of Neural NetworksAlgorithmError AnalysisGeneral Framework and Extension to RNNs
Hanie Sedghi 8/ 32
![Page 11: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/11.jpg)
Guaranteed Training of Neural Networks
Three Main Components
Hanie Sedghi 9/ 32
![Page 12: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/12.jpg)
Guaranteed Training of Neural Networks
Guaranteed Learning through Tensor Methods
Replace the Objective Function
Hanie Sedghi 10/ 32
![Page 13: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/13.jpg)
Guaranteed Training of Neural Networks
Guaranteed Learning through Tensor Methods
Replace the Objective Function
Best Tensor Decomposition
argminθ
‖T (x)− T (θ)‖.
T (x) : empirical tensor,T (θ) : low rank tensor based on θ
Hanie Sedghi 10/ 32
![Page 14: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/14.jpg)
Guaranteed Training of Neural Networks
Guaranteed Learning through Tensor Methods
Replace the Objective Function
Best Tensor Decomposition
argminθ
‖T (x)− T (θ)‖.
T (x) : empirical tensor,T (θ) : low rank tensor based on θ
Preserves Global minimum
Finding Globally optimal tensor decomposition
Simple algorithms succeed under mild and natural conditions.(Anandkumar et al ’14, Anandkumar, Ge and Janzamin ’14)
Hanie Sedghi 10/ 32
![Page 15: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/15.jpg)
Guaranteed Training of Neural Networks
Background: Tensor Decomposition
Rank-1 tensor:T = w · a⊗ b⊗ c ⇔ T (i, j, l) = w · a(i) · b(j) · c(l).
Hanie Sedghi 11/ 32
![Page 16: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/16.jpg)
Guaranteed Training of Neural Networks
Background: Tensor Decomposition
Rank-1 tensor:T = w · a⊗ b⊗ c ⇔ T (i, j, l) = w · a(i) · b(j) · c(l).CANDECOMP/PARAFAC (CP) Decomposition
T =∑
j∈[k]
wjaj ⊗ bj ⊗ cj ∈ Rd×d×d, aj , bj, cj ∈ Sd−1.
= + ....
Tensor T w1 · a1 ⊗ b1 ⊗ c1 w2 · a2 ⊗ b2 ⊗ c2
Hanie Sedghi 11/ 32
![Page 17: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/17.jpg)
Guaranteed Training of Neural Networks
Background: Tensor Decomposition
Rank-1 tensor:T = w · a⊗ b⊗ c ⇔ T (i, j, l) = w · a(i) · b(j) · c(l).CANDECOMP/PARAFAC (CP) Decomposition
T =∑
j∈[k]
wjaj ⊗ bj ⊗ cj ∈ Rd×d×d, aj , bj, cj ∈ Sd−1.
= + ....
Tensor T w1 · a1 ⊗ b1 ⊗ c1 w2 · a2 ⊗ b2 ⊗ c2
Algorithm: Alternating Least Square (ALS), Tensor poweriteration, . . .
Hanie Sedghi 11/ 32
![Page 18: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/18.jpg)
Guaranteed Training of Neural Networks
Three Main Components
Method of
Moments
Hanie Sedghi 12/ 32
![Page 19: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/19.jpg)
Guaranteed Training of Neural Networks
Method-of-Moments for Neural Networks
Supervised setting: observing {(xi, yi)}Non-linear transformations via activating function σ(·)
Hanie Sedghi 13/ 32
![Page 20: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/20.jpg)
Guaranteed Training of Neural Networks
Method-of-Moments for Neural Networks
Supervised setting: observing {(xi, yi)}Non-linear transformations via activating function σ(·)Random x and y:
Moment possibilities: E[y ⊗ y], E[y ⊗ x], . . .
Hanie Sedghi 13/ 32
![Page 21: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/21.jpg)
Guaranteed Training of Neural Networks
Method-of-Moments for Neural Networks
Supervised setting: observing {(xi, yi)}Non-linear transformations via activating function σ(·)Random x and y:
Moment possibilities: E[y ⊗ y], E[y ⊗ x], . . .
σ(·) σ(·)
y
x1 x2 x3x
A1 ⇒ If σ(·) is linear:A1 is linearly observed in output y. X
Hanie Sedghi 13/ 32
![Page 22: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/22.jpg)
Guaranteed Training of Neural Networks
Method-of-Moments for Neural Networks
Supervised setting: observing {(xi, yi)}Non-linear transformations via activating function σ(·)Random x and y:
Moment possibilities: E[y ⊗ y], E[y ⊗ x], . . .
σ(·) σ(·)
y
x1 x2 x3x
A1 ⇒E[y ⊗ x] = E[σ(A⊤
1 x)⊗ x]
No linear transformation of A1. ×
Hanie Sedghi 13/ 32
![Page 23: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/23.jpg)
Guaranteed Training of Neural Networks
Method-of-Moments for Neural Networks
Supervised setting: observing {(xi, yi)}Non-linear transformations via activating function σ(·)Random x and y:
Moment possibilities: E[y ⊗ y], E[y ⊗ x], . . .
σ(·) σ(·)
y
x1 x2 x3x
A1 ⇒E[y ⊗ x] = E[σ(A⊤
1 x)⊗ x]
No linear transformation of A1. ×
One solution: Linearization by using a derivative operator
σ(A⊤1 x)
Derivative−−−−−−→ σ′(·)A⊤1
Hanie Sedghi 13/ 32
![Page 24: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/24.jpg)
Guaranteed Training of Neural Networks
Method-of-Moments for Neural Networks
Supervised setting: observing {(xi, yi)}Non-linear transformations via activating function σ(·)Random x and y:
Moment possibilities: E[y ⊗ y], E[y ⊗ x], . . .
σ(·) σ(·)
y
x1 x2 x3x
A1 ⇒E[y ⊗ x] = E[σ(A⊤
1 x)⊗ x]
No linear transformation of A1. ×
One solution: Linearization by using a derivative operator
E[y ⊗ φ(x)] = E[∇xy], φ(·) =?
Hanie Sedghi 13/ 32
![Page 25: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/25.jpg)
Guaranteed Training of Neural Networks
Moments of a Neural Network
E[y|x] := f(x) = a⊤2 σ(A⊤1 x)
σ(·) σ(·)
y
x1 x2 x3x
a2
A1 =
Hanie Sedghi 14/ 32
![Page 26: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/26.jpg)
Guaranteed Training of Neural Networks
Moments of a Neural Network
E[y|x] := f(x) = a⊤2 σ(A⊤1 x)
Linearization using derivative operator:
φm(x) : m-th order derivative operator
σ(·) σ(·)
y
x1 x2 x3x
a2
A1 =
Hanie Sedghi 14/ 32
![Page 27: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/27.jpg)
Guaranteed Training of Neural Networks
Moments of a Neural Network
E[y|x] := f(x) = a⊤2 σ(A⊤1 x)
Linearization using derivative operator:
φm(x) : m-th order derivative operator
E [y · φ1(x)] = +
σ(·) σ(·)
y
x1 x2 x3x
a2
A1 =
Hanie Sedghi 14/ 32
![Page 28: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/28.jpg)
Guaranteed Training of Neural Networks
Moments of a Neural Network
E[y|x] := f(x) = a⊤2 σ(A⊤1 x)
Linearization using derivative operator:
φm(x) : m-th order derivative operator
E [y · φ2(x)] = +
σ(·) σ(·)
y
x1 x2 x3x
a2
A1 =
Hanie Sedghi 14/ 32
![Page 29: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/29.jpg)
Guaranteed Training of Neural Networks
Moments of a Neural Network
E[y|x] := f(x) = a⊤2 σ(A⊤1 x)
Linearization using derivative operator:
φm(x) : m-th order derivative operator
E [y · φ3(x)] = +
σ(·) σ(·)
y
x1 x2 x3x
a2
A1 =
Hanie Sedghi 14/ 32
![Page 30: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/30.jpg)
Guaranteed Training of Neural Networks
Moments of a Neural Network
E[y|x] := f(x) = a⊤2 σ(A⊤1 x)
Linearization using derivative operator:
φm(x) : m-th order derivative operator
E [y · φ3(x)] = +
σ(·) σ(·)
y
x1 x2 x3x
a2
A1 =
Why tensors are required?
Matrix decomposition recovers subspace, not actual weights.
Tensor decomposition uniquely recovers under non-degeneracy.
Hanie Sedghi 14/ 32
![Page 31: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/31.jpg)
Guaranteed Training of Neural Networks
Three Main Components
Hanie Sedghi 15/ 32
![Page 32: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/32.jpg)
Guaranteed Training of Neural Networks
Derivative Operator: Score Function Transformations
Continuous x with pdf p(·):
S1(x) := −∇x log p(x)
Input:
x ∈ Rd
S1(x) ∈ Rd
Hanie Sedghi 16/ 32
![Page 33: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/33.jpg)
Guaranteed Training of Neural Networks
Derivative Operator: Score Function Transformations
Continuous x with pdf p(·):mth-order score function:
Sm(x) := (−1)m∇(m)p(x)
p(x) Input:
x ∈ Rd
S1(x) ∈ Rd
Hanie Sedghi 16/ 32
![Page 34: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/34.jpg)
Guaranteed Training of Neural Networks
Derivative Operator: Score Function Transformations
Continuous x with pdf p(·):mth-order score function:
Sm(x) := (−1)m∇(m)p(x)
p(x) Input:
x ∈ Rd
S2(x) ∈ Rd×d
Hanie Sedghi 16/ 32
![Page 35: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/35.jpg)
Guaranteed Training of Neural Networks
Derivative Operator: Score Function Transformations
Continuous x with pdf p(·):mth-order score function:
Sm(x) := (−1)m∇(m)p(x)
p(x) Input:
x ∈ Rd
S3(x) ∈ Rd×d×d
Hanie Sedghi 16/ 32
![Page 36: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/36.jpg)
Guaranteed Training of Neural Networks
Derivative Operator: Score Function Transformations
Continuous x with pdf p(·):mth-order score function:
Sm(x) := (−1)m∇(m)p(x)
p(x) Input:
x ∈ Rd
S3(x) ∈ Rd×d×d
Hanie Sedghi 16/ 32
![Page 37: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/37.jpg)
Guaranteed Training of Neural Networks
Derivative Operator: Score Function Transformations
Continuous x with pdf p(·):mth-order score function:
Sm(x) := (−1)m∇(m)p(x)
p(x) Input:
x ∈ Rd
S3(x) ∈ Rd×d×d
Theorem (Score function property, JSA’14)
Providing derivative information: let E[y|x] := f(x), then
E [y ⊗ Sm(x)] = E
[∇(m)
x f(x)].
“Score Function Features for Discriminative Learning: Matrix and Tensor Framework”
by M. Janzamin, H. S. and A. Anandkumar, Dec. 2014.
Hanie Sedghi 16/ 32
![Page 38: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/38.jpg)
Guaranteed Training of Neural Networks
Three Main Components
Method of
Moments
Probabilistic
Models
& Score Fn.
Hanie Sedghi 17/ 32
![Page 39: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/39.jpg)
Guaranteed Training of Neural Networks
NN-LIFT: Neural Network-LearnIng using Feature Tensors
Input:
x ∈ Rd S3(x) ∈ R
d×d×d
![Page 40: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/40.jpg)
Guaranteed Training of Neural Networks
NN-LIFT: Neural Network-LearnIng using Feature Tensors
Input:
x ∈ Rd S3(x) ∈ R
d×d×d
1
n
n∑
i=1
yi⊗
Estimating moment using
labeled data {(xi, yi)}
S3(xi)
Cross-
moment
![Page 41: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/41.jpg)
Guaranteed Training of Neural Networks
NN-LIFT: Neural Network-LearnIng using Feature Tensors
Input:
x ∈ Rd S3(x) ∈ R
d×d×d
1
n
n∑
i=1
yi⊗
Estimating moment using
labeled data {(xi, yi)}
S3(xi)
Cross-
moment
+ + · · ·
Rank-1 components are the estimates of columns of A1
line 1
CP tensor
decomposition
![Page 42: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/42.jpg)
Guaranteed Training of Neural Networks
NN-LIFT: Neural Network-LearnIng using Feature Tensors
Input:
x ∈ Rd S3(x) ∈ R
d×d×d
1
n
n∑
i=1
yi⊗
Estimating moment using
labeled data {(xi, yi)}
S3(xi)
Cross-
moment
+ + · · ·
Rank-1 components are the estimates of columns of A1
line 1
CP tensor
decomposition
Fourier technique ⇒ b1 (bias of first layer)
Linear Regression ⇒ a2, b2 (parameters of last layer)
Hanie Sedghi 18/ 32
![Page 43: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/43.jpg)
Guaranteed Training of Neural Networks
Outline
Introduction
Guaranteed Training of Neural NetworksAlgorithmError AnalysisGeneral Framework and Extension to RNNs
Hanie Sedghi 19/ 32
![Page 44: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/44.jpg)
Guaranteed Training of Neural Networks
Estimation Error Bound
Theorem (JSA’15)
Two layer NN, realizeable setting
Full column rank assumption on weight matrix A1
number of samples n = poly(d, k), we have w.h.p.
|f(x)− f(x)|2 ≤ O(1/n).
“Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using
Tensor Methods” by M. Janzamin, H. S. and A. Anandkumar, June. 2015.
Hanie Sedghi 20/ 32
![Page 45: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/45.jpg)
Guaranteed Training of Neural Networks
Risk Bound
Generalization of neural network:
Hanie Sedghi 21/ 32
![Page 46: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/46.jpg)
Guaranteed Training of Neural Networks
Risk Bound
Generalization of neural network:
Approximation error in fitting the target function to a neuralnetwork
Estimation error in estimating the weights of fixed neuralnetwork
Hanie Sedghi 21/ 32
![Page 47: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/47.jpg)
Guaranteed Training of Neural Networks
Risk Bound
Generalization of neural network:
Approximation error in fitting the target function to a neuralnetwork
Estimation error in estimating the weights of fixed neuralnetwork
Known: continuous functions with compact domain can bearbitrarily well approximated by neural networks with onehidden layer.
Hanie Sedghi 21/ 32
![Page 48: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/48.jpg)
Guaranteed Training of Neural Networks
Approximation Error
Approximation error related to Fourier spectrum of f(x).(Barron ‘93).
E[y|x] = f(x)
Hanie Sedghi 22/ 32
![Page 49: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/49.jpg)
Guaranteed Training of Neural Networks
Approximation Error
Approximation error related to Fourier spectrum of f(x).(Barron ‘93).
E[y|x] = f(x)
F (ω) :=
∫
Rd
f(x)e−j〈ω,x〉dx
Cf :=
∫
Rd
‖ω‖2 · |F (ω)|dω
Approximation error ≤ Cf/√k
Hanie Sedghi 22/ 32
![Page 50: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/50.jpg)
Guaranteed Training of Neural Networks
Approximation Error
Approximation error related to Fourier spectrum of f(x).(Barron ‘93).
E[y|x] = f(x)
f(x)
FourierTransform
‖ω‖|F (w)|
Hanie Sedghi 22/ 32
![Page 51: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/51.jpg)
Guaranteed Training of Neural Networks
Our Main Result
Theorem(JSA’15)
Approximating arbitrary function f(x) with bounded Cf
n samples, d input dimension, k number of neurons.
Assume Cf is small.
Ex[|f(x)− f(x)|2] ≤ O(C2f/k) +O(1/n).
Polynomial sample complexity n in terms of dimensions d, k.
Computational complexity same as SGD with enough parallelprocessors.
“Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using
Tensor Methods” by M. Janzamin, H. S. and A. Anandkumar, June. 2015.
Hanie Sedghi 23/ 32
![Page 52: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/52.jpg)
Guaranteed Training of Neural Networks
Experiment: NN-LiFT vs. Backprop
MNIST dataset
Hanie Sedghi 24/ 32
![Page 53: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/53.jpg)
Guaranteed Training of Neural Networks
Experiment: NN-LiFT vs. Backprop
MNIST dataset
Use Denoising Auto-Encoder(DAE) to estimate score function.
DAE learns the first order scorefunction.
We learn higher order scorefunctions recursively. (JSA ’14)
Hanie Sedghi 24/ 32
![Page 54: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/54.jpg)
Guaranteed Training of Neural Networks
Experiment Results: NN-LiFT vs. Backprop
MNIST dataset
Use DAE to estimate score function.
NN-LiFT outperforms backpropagation with SGD even for lowhidden dimensions.
10% difference with 128 neurons
Hanie Sedghi 25/ 32
![Page 55: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/55.jpg)
Guaranteed Training of Neural Networks
Experiment Results: NN-LiFT vs. Backprop
MNIST dataset
Use DAE to estimate score function.
NN-LiFT outperforms backpropagation with SGD even for lowhidden dimensions.
10% difference with 128 neurons
Using Adam is not enough for backpropagation to win in highdimensions.
Hanie Sedghi 25/ 32
![Page 56: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/56.jpg)
Guaranteed Training of Neural Networks
Experiment Results: NN-LiFT vs. Backprop
MNIST dataset
Use DAE to estimate score function.
NN-LiFT outperforms backpropagation with SGD even for lowhidden dimensions.
10% difference with 128 neurons
Using Adam is not enough for backpropagation to win in highdimensions.
If we down-sample labeled data, NN-LiFT outperforms Adamby 6− 12%.
Hanie Sedghi 25/ 32
![Page 57: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/57.jpg)
Guaranteed Training of Neural Networks
Outline
Introduction
Guaranteed Training of Neural NetworksAlgorithmError AnalysisGeneral Framework and Extension to RNNs
Hanie Sedghi 26/ 32
![Page 58: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/58.jpg)
Guaranteed Training of Neural Networks
FEAST: Feature ExtrAction using Score function Tensors
Mixture of Generalized LinearModels (GLM)
E[y|x, h] = g(〈Uh, x〉 + 〈b, h〉)
Mixture of Linear Regression
E[y|x] =∑
i
πi〈ui, x〉+ 〈bi, h〉
Probabilistic
Models
& Score Fn.
“Provable Tensor Methods for Learning Mixtures of Generalized Linear Models” by H.
S., M. Janzamin and A. Anandkumar, AISTATS 2016
Hanie Sedghi 27/ 32
![Page 59: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/59.jpg)
Guaranteed Training of Neural Networks
Guaranteed Training of Recurrent Neural Networks
Input
Output
Hidden Layer
any t
(a) NN
Input
Output
Hidden Layer
xt+1xt
(b) IO-RNN
Input
Output
Hidden Layer
xtxt−1 xt+1
ytyt−1 yt+1
zt−1
(c) BRNN
Hanie Sedghi 28/ 32
![Page 60: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/60.jpg)
Guaranteed Training of Neural Networks
Guaranteed Training of Recurrent Neural Networks
Input
Output
Hidden Layer
any t
(a) NN
Input
Output
Hidden Layer
xt+1xt
(b) IO-RNN
Input
Output
Hidden Layer
xtxt−1 xt+1
ytyt−1 yt+1
zt−1
(c) BRNN
E[yt|ht)] = A⊤2 ht, ht = f(A1xt + Uht−1)
Hanie Sedghi 28/ 32
![Page 61: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/61.jpg)
Guaranteed Training of Neural Networks
Guaranteed Training of Recurrent Neural Networks
Input
Output
Hidden Layer
any t
(a) NN
Input
Output
Hidden Layer
xt+1xt
(b) IO-RNN
Input
Output
Hidden Layer
xtxt−1 xt+1
ytyt−1 yt+1
zt−1
(c) BRNN
E[yt|ht)] = A⊤2 ht, ht = f(A1xt + Uht−1)
E[yt|ht, zt)] = A⊤2
[htzt
], ht = f(A1xt + Uht−1), zt = g(B1xt + V zt+1)
Hanie Sedghi 28/ 32
![Page 62: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/62.jpg)
Guaranteed Training of Neural Networks
Guaranteed Training of Recurrent Neural Networks
Input
Output
Hidden Layer
any t
(a) NN
Input
Output
Hidden Layer
xt+1xt
(b) IO-RNN
Input
Output
Hidden Layer
xtxt−1 xt+1
ytyt−1 yt+1
zt−1
(c) BRNN
E[yt|ht)] = A⊤2 ht, ht = f(A1xt + Uht−1)
Input sequence, not i.i.d.
Learning the weight matrices between hidden layers
Need to ensure bounded state evolution
Hanie Sedghi 29/ 32
![Page 63: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/63.jpg)
Guaranteed Training of Neural Networks
Guaranteed Training of Recurrent Neural Networks
Input
Output
Hidden Layer
any t
(a) NN
Input
Output
Hidden Layer
xt+1xt
(b) IO-RNN
Input
Output
Hidden Layer
xtxt−1 xt+1
ytyt−1 yt+1
zt−1
(c) BRNN
Markovian Evolution of the input sequence
Extension of score function to Markov Chains
Polynomial activation functions
“Training Input-Output Recurrent Neural Networks through Spectral Methods” by H.
S. and A. Anandkumar, Preprint 2016
Hanie Sedghi 29/ 32
![Page 64: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/64.jpg)
Guaranteed Training of Neural Networks
References
M. Janzamin, H. Sedghi and A. Anandkumar, Beating the Perils ofNon-Convexity: Guaranteed Training of Neural Networks usingTensor Methods, 2015
M. Janzamin*, H. Sedghi* and A. Anandkumar, Score FunctionFeatures for Discriminative Learning: Matrix and TensorFramework, 2014
H.Sedghi and A. Anandkumar, Provable Methods for TrainingNeural Networks with Sparse Connectivity, NIPS Deep LearningWorkshop 2014, ICLR workshop 2015
H. Sedghi, M. Janzamin and A. Anandkumar, Provable TensorMethods for Learning Mixtures of Generalized Linear Models,AISTATS, 2016
H. Sedghi and A. Anandkumar, Training Input-Output RecurrentNeural Networks through Spectral Methods, 2016
Hanie Sedghi 30/ 32
![Page 65: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/65.jpg)
Guaranteed Training of Neural Networks
Conclusion and Future Work
Summary
For the first time guaranteed risk bound for neural networks
Efficient sample and computational complexityHigher order score functions as new features
Useful in general for recovering new discriminative features.
Extension to input sequences for training RNNs and BRNNs
Future Work
Extension to training Convolutional Neural Networks
Empirical performance
Regularization analysis
Hanie Sedghi 31/ 32
![Page 66: Hanie Sedghi, Research Scientist at Allen Institute for Artificial Intelligence, at MLconf Seattle 2017](https://reader031.vdocuments.net/reader031/viewer/2022030401/5aacef107f8b9a003b8b457d/html5/thumbnails/66.jpg)
Guaranteed Training of Neural Networks
Conclusion and Future Work
Summary
For the first time guaranteed risk bound for neural networks
Efficient sample and computational complexityHigher order score functions as new features
Useful in general for recovering new discriminative features.
Extension to input sequences for training RNNs and BRNNs
Future Work
Extension to training Convolutional Neural Networks
Empirical performance
Regularization analysis
Thank You!Hanie Sedghi 32/ 32