yc5/slides/gatech2019_twf.pdf · agenda 1.the power of nonconvex optimization in solving random...

83
The Power of Nonconvex Optimization in Solving Random Quadratic Systems of Equations Yuxin Chen (Princeton) Emmanuel Cand` es (Stanford) Y. Chen, E. J. Cand` es, Communications on Pure and Applied Mathematics vol. 70, no. 5, pp. 822-883, May 2017

Upload: others

Post on 28-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

The Power of Nonconvex Optimization inSolving Random Quadratic Systems of Equations

Yuxin Chen (Princeton) Emmanuel Candes (Stanford)

Y. Chen, E. J. Candes, Communications on Pure and Applied Mathematicsvol. 70, no. 5, pp. 822-883, May 2017

Page 2: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Agenda

1. The power of nonconvex optimization in solving random quadratic systems ofequations (Aug. 28)

2. Random initialization and implicit regularization in nonconvex statisticalestimation (Aug. 29)

3. The projected power method: an efficient nonconvex algorithm for jointdiscrete assignment from pairwise data (Sep. 3)

4. Spectral methods meets asymmetry: two recent stories (Sep. 4)

5. Inference and uncertainty quantification for noisy matrix completion (Sep. 5)

Page 3: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

(large-scale) optimization (high-dimensional) statistics

(Non)-

(large-scale) optimization (high-dimensional) statistics(large-scale) optimization (high-dimensional) statistics

nonconvex optimization

Page 4: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Nonconvex problems are everywhere

Maximum likelihood estimation is usually nonconvex

maximizex `(x; data) → may be nonconcavesubj. to x ∈ S → may be nonconvex

• low-rank matrix completion• robust principal component analysis• graph clustering• dictionary learning• blind deconvolution• learning neural nets• ...

Page 5: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Nonconvex problems are everywhere

Maximum likelihood estimation is usually nonconvex

maximizex `(x; data) → may be nonconcavesubj. to x ∈ S → may be nonconvex

• low-rank matrix completion• robust principal component analysis• graph clustering• dictionary learning• blind deconvolution• learning neural nets• ...

Page 6: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Nonconvex optimization may be super scary

There may be bumps everywhere and exponentially many local optima

e.g. 1-layer neural net (Auer, Herbster, Warmuth ’96; Vu ’98)

Page 7: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Example: solving quadratic programs is hard

Finding maximum cut in a graph is

maximizex x>Wx

subj. to x2i = 1, i = 1, · · · , n

Fig credit: coding horror

Page 8: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Example: solving quadratic programs is hard

Finding maximum cut in a graph is

maximizex x>Wx

subj. to x2i = 1, i = 1, · · · , n

Fig credit: coding horror

Page 9: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

One strategy: convex relaxation

Can relax into convex problems by

• finding convex surrogates (e.g. compressed sensing, matrix completion)

• lifting into higher dimensions (e.g. Max-Cut)

Page 10: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Example of convex surrogate: low-rank matrix completion— Fazel ’02, Recht, Parrilo, Fazel ’10, Candes, Recht ’09

minimizeM rank(M) subj. to data constraints

cvx surrogate

minimizeM nuc-norm(M) subj. to data constraints

Robust variation used everyday by Netflix

Problem: operate in full matrix space even though X is low-rank

Page 11: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Example of convex surrogate: low-rank matrix completion— Fazel ’02, Recht, Parrilo, Fazel ’10, Candes, Recht ’09

minimizeM rank(M) subj. to data constraints

cvx surrogate

minimizeM nuc-norm(M) subj. to data constraints

Robust variation used everyday by Netflix

Problem: operate in full matrix space even though X is low-rank

Page 12: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Example of convex surrogate: low-rank matrix completion— Fazel ’02, Recht, Parrilo, Fazel ’10, Candes, Recht ’09

minimizeM rank(M) subj. to data constraints

cvx surrogate

minimizeM nuc-norm(M) subj. to data constraints

Robust variation used everyday by Netflix

Problem: operate in full matrix space even though X is low-rank

Page 13: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Example of lifting: Max-Cut

— Goemans, Williamson ’95

maximizex x>Wx

subj. to x2i = 1, i = 1, · · · , n

let X be xx>

maximizeX 〈X,W 〉subj. to Xi,i = 1, i = 1, · · · , n

X � 0

Problem: explosion in dimensions (Rn → Rn×n)

Page 14: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Example of lifting: Max-Cut

— Goemans, Williamson ’95

maximizex x>Wx

subj. to x2i = 1, i = 1, · · · , n

let X be xx>

maximizeX 〈X,W 〉subj. to Xi,i = 1, i = 1, · · · , n

X � 0

rank(X) = 1

Problem: explosion in dimensions (Rn → Rn×n)

Page 15: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Example of lifting: Max-Cut

— Goemans, Williamson ’95

maximizex x>Wx

subj. to x2i = 1, i = 1, · · · , n

let X be xx>

maximizeX 〈X,W 〉subj. to Xi,i = 1, i = 1, · · · , n

X � 0

rank(X) = 1

Problem: explosion in dimensions (Rn → Rn×n)

Page 16: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Example of lifting: Max-Cut

— Goemans, Williamson ’95

maximizex x>Wx

subj. to x2i = 1, i = 1, · · · , n

let X be xx>

maximizeX 〈X,W 〉subj. to Xi,i = 1, i = 1, · · · , n

X � 0

rank(X) = 1

Problem: explosion in dimensions (Rn → Rn×n)

Page 17: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

How about optimizing nonconvex problems directly without lifting?

Page 18: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

A case study: solving random quadratic systems of equations

Page 19: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Solving quadratic systems of equationsA

x

Ax

1

A

x

Ax

1

A

x

Ax

1

A

x

Ax

y = |Ax|2

1

1

-3

2

-1

4

2 -2

-1

3

4

1

9

4

1

16

4 4

1

9

16

Solve for x ∈ Cn in m quadratic equations

yk ≈ |〈ak,x〉|2, k = 1, . . . ,m

Page 20: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Motivation: a missing phase problem in imaging science

Detectors record intensities of diffracted rays• x(t1, t2) −→ Fourier transform x(f1, f2)

intensity of electrical field:∣∣x(f1, f2)

∣∣2 =∣∣∣∫ x(t1, t2)e−i2π(f1t1+f2t2)dt1dt2

∣∣∣2Phase retrieval: recover true signal x(t1, t2) from intensity measurements

Page 21: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Motivation: latent variable models

Example: mixture of regression

A

x

Ax

y = |Ax|2

| · |2

entrywisesquared magnitude

minimizeX ` (b)

s.t. bk = a⇤kXak, k = 1, · · · , m

X ⌫ 0

yk =

(hxk,�i + ⌘k, with prob. 1

2

hxk,��i + ⌘k, else

y ⇡ hx,�iy ⇡ hx,��i

1

A

x

Ax

y = |Ax|2

| · |2

entrywisesquared magnitude

minimizeX ` (b)

s.t. bk = a⇤kXak, k = 1, · · · , m

X ⌫ 0

yk =

(hxk,�i + ⌘k, with prob. 1

2

hxk,��i + ⌘k, else

y ⇡ hx,�iy ⇡ hx,��i

1

• Samples {(yk,xk)}: drawn from one of two unknown regressors β and −β

yk ≈{〈xk,β〉 , with prob. 0.5〈xk,−β〉 , else

(labels: latent variables)

— equivalent to observing |yk|2 ≈ |〈xk,β〉|2

• Goal: estimate β

Page 22: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Motivation: latent variable models

Example: mixture of regression

A

x

Ax

y = |Ax|2

| · |2

entrywisesquared magnitude

minimizeX ` (b)

s.t. bk = a⇤kXak, k = 1, · · · , m

X ⌫ 0

yk =

(hxk,�i + ⌘k, with prob. 1

2

hxk,��i + ⌘k, else

y ⇡ hx,�iy ⇡ hx,��i

1

A

x

Ax

y = |Ax|2

| · |2

entrywisesquared magnitude

minimizeX ` (b)

s.t. bk = a⇤kXak, k = 1, · · · , m

X ⌫ 0

yk =

(hxk,�i + ⌘k, with prob. 1

2

hxk,��i + ⌘k, else

y ⇡ hx,�iy ⇡ hx,��i

1

• Samples {(yk,xk)}: drawn from one of two unknown regressors β and −β

yk ≈{〈xk,β〉 , with prob. 0.5〈xk,−β〉 , else

(labels: latent variables)

— equivalent to observing |yk|2 ≈ |〈xk,β〉|2

• Goal: estimate β

Page 23: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Motivation: learning neural nets with quadratic activation

— Soltanolkotabi, Javanmard, Lee ’17, Li, Ma, Zhang ’17

hidden layer input layer output layer

1

hidden layer input layer output layer

1

hidden layer input layer output layer

1

hidden layer input layer output layer

x y W � � y

1

hidden layer input layer output layer

x y W � � y

1

hidden layer input layer output layer

x y W � � y

1

hidden layer input layer output layer

x y W � � y

1

hidden layer input layer output layer

x y W � � y +

1

a

X\

Motivation: learning neural nets with quadratic activation

— Soltanolkotabi, Javanmard, Lee ’17, Li, Ma, Zhang ’17

hidden layer input layer output layer

1

hidden layer input layer output layer

1

hidden layer input layer output layer

1

hidden layer input layer output layer

x y W � � y

1

hidden layer input layer output layer

x y W � � y

1

hidden layer input layer output layer

x y W � � y

1

hidden layer input layer output layer

x y W � � y

1

hidden layer input layer output layer

x y W � � y +

1

a

X�

input features: a; weights: X = [x1, · · · , xr]

output: y =rX

i=1

�(a>xi)�(z)=z2

:=rX

i=1

(a>xi)2

Motivation: learning neural nets with quadratic activation

— Soltanolkotabi, Javanmard, Lee ’17, Li, Ma, Zhang ’17

hidden layer input layer output layer

1

hidden layer input layer output layer

1

hidden layer input layer output layer

1

hidden layer input layer output layer

x y W � � y

1

hidden layer input layer output layer

x y W � � y

1

hidden layer input layer output layer

x y W � � y

1

hidden layer input layer output layer

x y W � � y

1

hidden layer input layer output layer

x y W � � y +

1

a

X�

input features: a; weights: X = [x1, · · · , xr]

output: y =rX

i=1

�(a>xi)�(z)=z2

:=rX

i=1

(a>xi)2

input features: a; weights: X = [x1, · · · ,xr]

output: y =r∑i=1

σ(a>xi)σ(z)=z2

:=r∑i=1

(a>xi)2

Page 24: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

An equivalent view: low-rank factorization

Lifting: introduce X = xx∗ to linearize constraints

yk = |a∗kx|2 = a∗k(xx∗)ak =⇒ yk = a∗kXak

find X � 0

s.t. yk = a∗kXak, k = 1, · · · ,m

rank(X) = 1

Works well if {ak} are random, but huge increase in dimensions

Page 25: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

An equivalent view: low-rank factorization

Lifting: introduce X = xx∗ to linearize constraints

yk = |a∗kx|2 = a∗k(xx∗)ak =⇒ yk = a∗kXak

find X � 0

s.t. yk = a∗kXak, k = 1, · · · ,mrank(X) = 1

rank(X) = 1

Works well if {ak} are random, but huge increase in dimensions

Page 26: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

An equivalent view: low-rank factorization

Lifting: introduce X = xx∗ to linearize constraints

yk = |a∗kx|2 = a∗k(xx∗)ak =⇒ yk = a∗kXak

find X � 0

s.t. yk = a∗kXak, k = 1, · · · ,mrank(X) = 1

Works well if {ak} are random, but huge increase in dimensions

Page 27: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

An equivalent view: low-rank factorization

Lifting: introduce X = xx∗ to linearize constraints

yk = |a∗kx|2 = a∗k(xx∗)ak =⇒ yk = a∗kXak

find X � 0

s.t. yk = a∗kXak, k = 1, · · · ,mrank(X) = 1

Works well if {ak} are random, but huge increase in dimensions

Page 28: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Prior art (before our work)n: # unknowns; m: sample size (# eqns); y = |Ax|2,A ∈ Rm×n

sample complexity(# equations)

comput. costnmnmn2

n log nn log3 n

1

cvx relaxation

comput. cost

sample complexity

infeasible

This work: random quadratic systems are solvable in linear time!

X minimal sample sizeX optimal statistical accuracy

Page 29: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Prior art (before our work)n: # unknowns; m: sample size (# eqns); y = |Ax|2,A ∈ Rm×n

sample complexity(# equations)

comput. costnmnmn2

n log nn log3 n

1

sample complexity(# equations)

comput. costnmnmn2

n log nn log3 n

1

cvx relaxation

comput. cost

sample complexity

infeasible

infeasible

This work: random quadratic systems are solvable in linear time!

X minimal sample sizeX optimal statistical accuracy

Page 30: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Prior art (before our work)n: # unknowns; m: sample size (# eqns); y = |Ax|2,A ∈ Rm×n

sample complexity(# equations)

comput. costnmnmn2

n log nn log3 n

1

sample complexity(# equations)

comput. costnmnmn2

n log nn log3 n

1

sample complexity(# equations)

comput. costnmnmn2

n log nn log3 n

1

cvx relaxation

comput. cost

sample complexity

infeasible

infeasible

This work: random quadratic systems are solvable in linear time!

X minimal sample sizeX optimal statistical accuracy

Page 31: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Prior art (before our work)n: # unknowns; m: sample size (# eqns); y = |Ax|2,A ∈ Rm×n

sample complexity(# equations)

comput. costnmnmn2

n log nn log3 n

1

sample complexity(# equations)

comput. costnmnmn2

n log nn log3 n

1

sample complexity(# equations)

comput. costnmnmn2

n log nn log3 n

1

sample complexity(# equations)

comput. costnmnmn2

n log nn log3 n

1

cvx relaxation

comput. cost

sample complexity

Wirtinger flow

infeasible

infeasible

This work: random quadratic systems are solvable in linear time!

X minimal sample sizeX optimal statistical accuracy

Page 32: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Prior art (before our work)n: # unknowns; m: sample size (# eqns); y = |Ax|2,A ∈ Rm×n

sample complexity(# equations)

comput. costnmnmn2

n log nn log3 n

1

sample complexity(# equations)

comput. costnmnmn2

n log nn log3 n

1

sample complexity(# equations)

comput. costnmnmn2

n log nn log3 n

1

sample complexity(# equations)

comput. costnmnmn2

n log nn log3 n

1

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

1

cvx relaxation

comput. cost

sample complexity

Wirtinger flow

alt-min (fresh samples at each iter)

infeasible

infeasible

This work: random quadratic systems are solvable in linear time!

X minimal sample sizeX optimal statistical accuracy

Page 33: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

A glimpse of our resultsn: # unknowns; m: sample size (# eqns); y = |Ax|2,A ∈ Rm×n

sample complexity(# equations)

comput. costnmnmn2

n log nn log3 n

1

sample complexity(# equations)

comput. costnmnmn2

n log nn log3 n

1

sample complexity(# equations)

comput. costnmnmn2

n log nn log3 n

1

sample complexity(# equations)

comput. costnmnmn2

n log nn log3 n

1

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

1

cvx relaxation

comput. cost

sample complexity

Wirtinger flow

alt-min (fresh samples at each iter)

infeasible

infeasible

Our algorithm

This work: random quadratic systems are solvable in linear time!

X minimal sample sizeX optimal statistical accuracy

Page 34: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

A glimpse of our resultsn: # unknowns; m: sample size (# eqns); y = |Ax|2,A ∈ Rm×n

sample complexity(# equations)

comput. costnmnmn2

n log nn log3 n

1

sample complexity(# equations)

comput. costnmnmn2

n log nn log3 n

1

sample complexity(# equations)

comput. costnmnmn2

n log nn log3 n

1

sample complexity(# equations)

comput. costnmnmn2

n log nn log3 n

1

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

1

cvx relaxation

comput. cost

sample complexity

Wirtinger flow

alt-min (fresh samples at each iter)

infeasible

infeasible

Our algorithm

This work: random quadratic systems are solvable in linear time!X minimal sample size

X optimal statistical accuracy

Page 35: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

A first impulse: maximum likelihood estimate

maximizez `(z) = 1m

∑m

k=1`k(z)

• Gaussian data: yk ∼ |a∗kx|2 +N (0, σ2)

`k(z) = −(yk − |a∗kz|

2 )2

• Poisson data: yk ∼ Poisson(|a∗kx|

2 )`k(z) = −|a∗kz|2 + yk log |a∗kz|2

Problem: −` nonconvex, many local stationary points

Page 36: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

A first impulse: maximum likelihood estimate

maximizez `(z) = 1m

∑m

k=1`k(z)

• Gaussian data: yk ∼ |a∗kx|2 +N (0, σ2)

`k(z) = −(yk − |a∗kz|

2 )2

• Poisson data: yk ∼ Poisson(|a∗kx|

2 )`k(z) = −|a∗kz|2 + yk log |a∗kz|2

Problem: −` nonconvex, many local stationary points

Page 37: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

A first impulse: maximum likelihood estimate

maximizez `(z) = 1m

∑m

k=1`k(z)

• Gaussian data: yk ∼ |a∗kx|2 +N (0, σ2)

`k(z) = −(yk − |a∗kz|

2 )2

• Poisson data: yk ∼ Poisson(|a∗kx|

2 )`k(z) = −|a∗kz|2 + yk log |a∗kz|2

Problem: −` nonconvex, many local stationary points

Page 38: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

A first impulse: maximum likelihood estimate

maximizez `(z) = 1m

∑m

k=1`k(z)

• Gaussian data: yk ∼ |a∗kx|2 +N (0, σ2)

`k(z) = −(yk − |a∗kz|

2 )2

• Poisson data: yk ∼ Poisson(|a∗kx|

2 )`k(z) = −|a∗kz|2 + yk log |a∗kz|2

Problem: −` nonconvex, many local stationary points

Page 39: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Wirtinger flow: Candes, Li, Soltanolkotabi ’14

• Spectral initialization: z0 ← leading eigenvector of

1m

m∑k=1

ykaka∗k

• Iterative refinement: for t = 0, 1, . . .

zt+1 = zt + µt∇`(zt)

Already rich theory (see also Soltanolkotabi ’14, Ma, Wang, Chi, Chen ’17)

Page 40: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Interpretation of spectral initialization

Spectral initialization: z0 ← leading eigenvector of

Y := 1m

m∑k=1

ykaka∗k

• Rationale: E[Y ] = I + 2xx∗ (‖x‖2 = 1) under Gaussian design

• Would succeed if Y → E[Y ]

Page 41: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Interpretation of spectral initialization

Spectral initialization: z0 ← leading eigenvector of

Y := 1m

m∑k=1

ykaka∗k

• Rationale: E[Y ] = I + 2xx∗ (‖x‖2 = 1) under Gaussian design

• Would succeed if Y → E[Y ]

Page 42: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Interpretation of spectral initialization

Spectral initialization: z0 ← leading eigenvector of

Y := 1m

m∑k=1

ykaka∗k

• Rationale: E[Y ] = I + 2xx∗ (‖x‖2 = 1) under Gaussian design

• Would succeed if Y → E[Y ]

Page 43: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Empirical performance of initialization (m = 12n)

Ground truth x ∈ R409600

Spectral initialization

Page 44: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Empirical performance of initialization (m = 12n)

Ground truth x ∈ R409600

Spectral initialization

Page 45: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Improving initializationY = 1

m

∑k

ykaka∗k︸ ︷︷ ︸

heavy-tailed

9 E[Y ] unless m� n

1 6000 120001

2

3

4

1kakk2 a⇤

kY ak x⇤Y x

1 6000 120001

2

3

4

k

1

1kakk2 a⇤

kY ak x⇤Y x

1 6000 120001

2

3

4

k

1

1kakk2 a⇤

kY ak x⇤Y x

1 6000 120001

2

3

4

a⇤kY ak

1 6000 120001

2

3

4

k

1

1kakk2 a⇤

kY ak x⇤Y x

1 6000 120001

2

3

4

k

1

1kakk2 a⇤

kY ak x⇤Y x

1 6000 120001

2

3

4

k

1

k (m = 6n)

1

Problem large outliers yk = |a∗kx|2 bear too much influence

Solution discard large samples and run PCA for 1m

∑k

ykaka∗k1{|yk|.Avg{|yl|}}

— improvable via more refined pre-processing(Wang, Giannakis, Eldar ’16, Lu, Li ’17, Mondelli, Montanari ’17)

1m

∑k

ρ(yk)aka∗k e.g. ρ(yk) = max{yk, a}

Page 46: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Improving initializationY = 1

m

∑k

ykaka∗k︸ ︷︷ ︸

heavy-tailed

9 E[Y ] unless m� n

1 6000 120001

2

3

4

1kakk2 a⇤

kY ak x⇤Y x

1 6000 120001

2

3

4

k

1

1kakk2 a⇤

kY ak x⇤Y x

1 6000 120001

2

3

4

k

1

1kakk2 a⇤

kY ak x⇤Y x

1 6000 120001

2

3

4

a⇤kY ak

1 6000 120001

2

3

4

k

1

1kakk2 a⇤

kY ak x⇤Y x

1 6000 120001

2

3

4

k

1

1kakk2 a⇤

kY ak x⇤Y x

1 6000 120001

2

3

4

k

1

k (m = 6n)

1

Problem large outliers yk = |a∗kx|2 bear too much influence

Solution discard large samples and run PCA for 1m

∑k

ykaka∗k1{|yk|.Avg{|yl|}}

— improvable via more refined pre-processing(Wang, Giannakis, Eldar ’16, Lu, Li ’17, Mondelli, Montanari ’17)

1m

∑k

ρ(yk)aka∗k e.g. ρ(yk) = max{yk, a}

Page 47: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Improving initializationY = 1

m

∑k

ykaka∗k︸ ︷︷ ︸

heavy-tailed

9 E[Y ] unless m� n

1 6000 120001

2

3

4

1kakk2 a⇤

kY ak x⇤Y x

1 6000 120001

2

3

4

k

1

1kakk2 a⇤

kY ak x⇤Y x

1 6000 120001

2

3

4

k

1

1kakk2 a⇤

kY ak x⇤Y x

1 6000 120001

2

3

4

a⇤kY ak

1 6000 120001

2

3

4

k

1

1kakk2 a⇤

kY ak x⇤Y x

1 6000 120001

2

3

4

k

1

1kakk2 a⇤

kY ak x⇤Y x

1 6000 120001

2

3

4

k

1

k (m = 6n)

1

Problem large outliers yk = |a∗kx|2 bear too much influence

Solution discard large samples and run PCA for 1m

∑k

ykaka∗k1{|yk|.Avg{|yl|}}

— improvable via more refined pre-processing(Wang, Giannakis, Eldar ’16, Lu, Li ’17, Mondelli, Montanari ’17)

1m

∑k

ρ(yk)aka∗k e.g. ρ(yk) = max{yk, a}

Page 48: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Improving initializationY = 1

m

∑k

ykaka∗k︸ ︷︷ ︸

heavy-tailed

9 E[Y ] unless m� n

1 6000 120001

2

3

4

1kakk2 a⇤

kY ak x⇤Y x

1 6000 120001

2

3

4

k

1

1kakk2 a⇤

kY ak x⇤Y x

1 6000 120001

2

3

4

k

1

1kakk2 a⇤

kY ak x⇤Y x

1 6000 120001

2

3

4

a⇤kY ak

1 6000 120001

2

3

4

k

1

1kakk2 a⇤

kY ak x⇤Y x

1 6000 120001

2

3

4

k

1

1kakk2 a⇤

kY ak x⇤Y x

1 6000 120001

2

3

4

k

1

k (m = 6n)

1

Problem large outliers yk = |a∗kx|2 bear too much influence

Solution discard large samples and run PCA for 1m

∑k

ykaka∗k1{|yk|.Avg{|yl|}}

— improvable via more refined pre-processing(Wang, Giannakis, Eldar ’16, Lu, Li ’17, Mondelli, Montanari ’17)

1m

∑k

ρ(yk)aka∗k e.g. ρ(yk) = max{yk, a}

Page 49: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Improving initializationY = 1

m

∑k

ykaka∗k︸ ︷︷ ︸

heavy-tailed

9 E[Y ] unless m� n

1 6000 120001

2

3

4

1kakk2 a⇤

kY ak x⇤Y x

1 6000 120001

2

3

4

k

1

1kakk2 a⇤

kY ak x⇤Y x

1 6000 120001

2

3

4

k

1

1kakk2 a⇤

kY ak x⇤Y x

1 6000 120001

2

3

4

a⇤kY ak

1 6000 120001

2

3

4

k

1

1kakk2 a⇤

kY ak x⇤Y x

1 6000 120001

2

3

4

k

1

1kakk2 a⇤

kY ak x⇤Y x

1 6000 120001

2

3

4

k

1

k (m = 6n)

1

Problem large outliers yk = |a∗kx|2 bear too much influence

Solution discard large samples and run PCA for 1m

∑k

ykaka∗k1{|yk|.Avg{|yl|}}

— improvable via more refined pre-processing(Wang, Giannakis, Eldar ’16, Lu, Li ’17, Mondelli, Montanari ’17)

1m

∑k

ρ(yk)aka∗k e.g. ρ(yk) = max{yk, a}

Page 50: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Empirical performance of initialization (m = 12n)

Ground truth x ∈ R409600

Regularized spectral initialization

Page 51: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Iterative refinement stage: search directions

Wirtinger flow: zt+1 = zt − µtm

m∑k=1

(yk − |a>k zt|2

)aka

>k z

t︸ ︷︷ ︸=−∇`k(zt)

Even in a local region around x (e.g. {z | ‖z − x‖2 ≤ 0.1‖x‖2}):

• f(·) is NOT strongly convex unless m� n

• f(·) has huge smoothness parameter

z

x

locus of {−∇`k(z)}

Problem: descent direction has large variability

Page 52: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Iterative refinement stage: search directions

Wirtinger flow: zt+1 = zt − µtm

m∑k=1

(yk − |a>k zt|2

)aka

>k z

t︸ ︷︷ ︸=−∇`k(zt)

Even in a local region around x (e.g. {z | ‖z − x‖2 ≤ 0.1‖x‖2}):

• f(·) is NOT strongly convex unless m� n

• f(·) has huge smoothness parameter

z

x

locus of {−∇`k(z)}

Problem: descent direction has large variability

Page 53: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Iterative refinement stage: search directions

Wirtinger flow: zt+1 = zt − µtm

m∑k=1

(yk − |a>k zt|2

)aka

>k z

t︸ ︷︷ ︸=−∇`k(zt)

Even in a local region around x (e.g. {z | ‖z − x‖2 ≤ 0.1‖x‖2}):

• f(·) is NOT strongly convex unless m� n

• f(·) has huge smoothness parameter

z

x

locus of {−∇`k(z)}

Problem: descent direction has large variability

Page 54: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Our solution: variance reduction via proper trimming

More adaptive rule:

zt+1 = zt − µtm

m∑i=1

yi − |a>i zt|2

a>i zt

ai1Ei1(zt)∩Ei2(zt)

where Ei1(z) ={αlbz ≤

|a>i z|‖z‖2

≤ αubz

}; Ei2(z) =

{|yi − |a>i z|

2| ≤αhm

∥∥y−A(zz>)∥∥

1|a>i z|

‖z‖2

}

z

x

informally, zt+1 = zt + µm

∑k∈Tt ∇`k(zt)

• Tt trims away excessively large gradcomponents

Slight bias + much reduced variance

Page 55: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Our solution: variance reduction via proper trimming

More adaptive rule:

zt+1 = zt − µtm

m∑i=1

yi − |a>i zt|2

a>i zt

ai1Ei1(zt)∩Ei2(zt)

where Ei1(z) ={αlbz ≤

|a>i z|‖z‖2

≤ αubz

}; Ei2(z) =

{|yi − |a>i z|

2| ≤αhm

∥∥y−A(zz>)∥∥

1|a>i z|

‖z‖2

}

z

x

informally, zt+1 = zt + µm

∑k∈Tt ∇`k(zt)

• Tt trims away excessively large gradcomponents

Slight bias + much reduced variance

Page 56: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Our solution: variance reduction via proper trimming

More adaptive rule:

zt+1 = zt − µtm

m∑i=1

yi − |a>i zt|2

a>i zt

ai1Ei1(zt)∩Ei2(zt)

where Ei1(z) ={αlbz ≤

|a>i z|‖z‖2

≤ αubz

}; Ei2(z) =

{|yi − |a>i z|

2| ≤αhm

∥∥y−A(zz>)∥∥

1|a>i z|

‖z‖2

}

z

x

informally, zt+1 = zt + µm

∑k∈Tt ∇`k(zt)

• Tt trims away excessively large gradcomponents

Slight bias + much reduced variance

Page 57: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Our solution: variance reduction via proper trimming

More adaptive rule:

zt+1 = zt − µtm

m∑i=1

yi − |a>i zt|2

a>i zt

ai1Ei1(zt)∩Ei2(zt)

where Ei1(z) ={αlbz ≤

|a>i z|‖z‖2

≤ αubz

}; Ei2(z) =

{|yi − |a>i z|

2| ≤αhm

∥∥y−A(zz>)∥∥

1|a>i z|

‖z‖2

}

z

x

informally, zt+1 = zt + µm

∑k∈Tt ∇`k(zt)

• Tt trims away excessively large gradcomponents

Slight bias + much reduced variance

Page 58: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Larger step size µt is feasible

q  q  

A

x

Ax

y = |Ax|2

| · |2

entrywisesquared magnitude

minimizeX ` (b)

s.t. bk = a⇤kXak, k = 1, · · · , m

X ⌫ 0

yk =

(hxk,�i + ⌘k, with prob. 1

2

hxk,��i + ⌘k, else

y ⇡ hx,�iy ⇡ hx,��i

initial guess z0

basin of attraction

x

x

1

A

x

Ax

y = |Ax|2

| · |2

entrywisesquared magnitude

minimizeX ` (b)

s.t. bk = a⇤kXak, k = 1, · · · , m

X ⌫ 0

yk =

(hxk,�i + ⌘k, with prob. 1

2

hxk,��i + ⌘k, else

y ⇡ hx,�iy ⇡ hx,��i

initial guess z0

x

z1

z2

z3

z4

a1

al

ak

a500

a8000

x

basin of attraction

1

without trimming: µt = O(1/n)

q  q  

A

x

Ax

y = |Ax|2

| · |2

entrywisesquared magnitude

minimizeX ` (b)

s.t. bk = a⇤kXak, k = 1, · · · , m

X ⌫ 0

yk =

(hxk,�i + ⌘k, with prob. 1

2

hxk,��i + ⌘k, else

y ⇡ hx,�iy ⇡ hx,��i

initial guess z0

basin of attraction

x

x

1

A

x

Ax

y = |Ax|2

| · |2

entrywisesquared magnitude

minimizeX ` (b)

s.t. bk = a⇤kXak, k = 1, · · · , m

X ⌫ 0

yk =

(hxk,�i + ⌘k, with prob. 1

2

hxk,��i + ⌘k, else

y ⇡ hx,�iy ⇡ hx,��i

initial guess z0

x

z1

z2

z3

z4

a1

al

ak

a500

a8000

x

basin of attraction

1

with trimming: µt = O(1)

With better-controlled descent directions, one proceeds far more aggressively

Page 59: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Summary: truncated Wirtinger flows (TWF)

1. Regularized spectral initialization: z0 ← leading eigenvector of

1m

∑k∈T0

yk aka∗k

2. Regularized gradient descent

zt+1 = zt + µt1m

∑k∈Tt∇`k(zt)︸ ︷︷ ︸

:=∇`tr(zt)

Key idea: adaptively discard high-leverage data

Page 60: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Performance guarantees of TWF (noiseless data)

dist(z,x) := min{‖z ± x‖2}

Theorem (Chen & Candes ’15). Under i.i.d. Gaussian design, TWF achieves

dist(zt,x) . (1− ρ)t ‖x‖2, t = 0, 1, · · ·

with high prob., provided that sample size m & n. Here, 0 < ρ < 1 is const.

A

x

Ax

y = |Ax|2

| · |2

entrywisesquared magnitude

minimizeX ` (b)

s.t. bk = a⇤kXak, k = 1, · · · , m

X ⌫ 0

yk =

(hxk,�i + ⌘k, with prob. 1

2

hxk,��i + ⌘k, else

y ⇡ hx,�iy ⇡ hx,��i

initial guess z0

x

1

A

x

Ax

y = |Ax|2

| · |2

entrywisesquared magnitude

minimizeX ` (b)

s.t. bk = a⇤kXak, k = 1, · · · , m

X ⌫ 0

yk =

(hxk,�i + ⌘k, with prob. 1

2

hxk,��i + ⌘k, else

y ⇡ hx,�iy ⇡ hx,��i

initial guess z0

basin of attraction

x

x

1

A

x

Ax

y = |Ax|2

| · |2

entrywisesquared magnitude

minimizeX ` (b)

s.t. bk = a⇤kXak, k = 1, · · · , m

X ⌫ 0

yk =

(hxk,�i + ⌘k, with prob. 1

2

hxk,��i + ⌘k, else

y ⇡ hx,�iy ⇡ hx,��i

initial guess z0

x

x

basin of attraction

1

start within basin of attraction linear convergence

Page 61: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Performance guarantees of TWF (noiseless data)

dist(z,x) := min{‖z ± x‖2}

Theorem (Chen & Candes ’15). Under i.i.d. Gaussian design, TWF achieves

dist(zt,x) . (1− ρ)t ‖x‖2, t = 0, 1, · · ·

with high prob., provided that sample size m & n. Here, 0 < ρ < 1 is const.

A

x

Ax

y = |Ax|2

| · |2

entrywisesquared magnitude

minimizeX ` (b)

s.t. bk = a⇤kXak, k = 1, · · · , m

X ⌫ 0

yk =

(hxk,�i + ⌘k, with prob. 1

2

hxk,��i + ⌘k, else

y ⇡ hx,�iy ⇡ hx,��i

initial guess z0

x

1

A

x

Ax

y = |Ax|2

| · |2

entrywisesquared magnitude

minimizeX ` (b)

s.t. bk = a⇤kXak, k = 1, · · · , m

X ⌫ 0

yk =

(hxk,�i + ⌘k, with prob. 1

2

hxk,��i + ⌘k, else

y ⇡ hx,�iy ⇡ hx,��i

initial guess z0

basin of attraction

x

x

1

A

x

Ax

y = |Ax|2

| · |2

entrywisesquared magnitude

minimizeX ` (b)

s.t. bk = a⇤kXak, k = 1, · · · , m

X ⌫ 0

yk =

(hxk,�i + ⌘k, with prob. 1

2

hxk,��i + ⌘k, else

y ⇡ hx,�iy ⇡ hx,��i

initial guess z0

x

x

basin of attraction

1

start within basin of attraction linear convergence

Page 62: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Computational complexity

A := {a∗k}1≤k≤m

• Initialization: leading eigenvector → a few applications of A and A∗∑k∈T0

yk aka∗k = A∗ diag{yk · 1k∈T0}A

• Iterations: one application of A and A∗ per iteration

zt+1 = zt + µtm∇`tr(zt)

−∇`tr(zt) = A∗ν

ν = 2 |Azt|2−yAzt

· 1T

Approximate runtime: several tens of applications of A and A∗

Page 63: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Computational complexity

A := {a∗k}1≤k≤m

• Initialization: leading eigenvector → a few applications of A and A∗∑k∈T0

yk aka∗k = A∗ diag{yk · 1k∈T0}A

• Iterations: one application of A and A∗ per iteration

zt+1 = zt + µtm∇`tr(zt)

−∇`tr(zt) = A∗ν

ν = 2 |Azt|2−yAzt

· 1T

Approximate runtime: several tens of applications of A and A∗

Page 64: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Computational complexity

A := {a∗k}1≤k≤m

• Initialization: leading eigenvector → a few applications of A and A∗∑k∈T0

yk aka∗k = A∗ diag{yk · 1k∈T0}A

• Iterations: one application of A and A∗ per iteration

zt+1 = zt + µtm∇`tr(zt)

−∇`tr(zt) = A∗ν

ν = 2 |Azt|2−yAzt

· 1T

Approximate runtime: several tens of applications of A and A∗

Page 65: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Numerical surprise

• CG: solve y = Ax • Our algorithm: solve y = |Ax|2

For random quadratic systems (m = 8n)comput. cost of our algo. ≈ 4 × comput. cost of least squares

Page 66: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Empirical performance

After regularized spectral initialization

After 50 TWF iterations

Page 67: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Empirical performance

After regularized spectral initialization

After 50 TWF iterations

Page 68: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Key convergence condition for gradient stage

If there are many samples:

∀z s.t. dist(z,x) ≤ ε‖x‖2:

〈∇`(z), x− z〉 & ‖z − x‖22 + ‖∇`(z)‖2

2

zx

z∇`(z)

∇`(z)

z

x

z∇`(z) ∇`(z)

zx

z∇`tr(z)

∇`tr(z)

Page 69: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Key convergence condition for gradient stage

If there are NOT many samples, i.e. m � n:

∀z s.t. dist(z,x) ≤ ε‖x‖2:

〈∇`(z), x− z〉 & ‖z − x‖22 + ‖∇`(z)‖2

2

zx

z∇`(z)

∇`(z)

z

x

z∇`(z) ∇`(z)

zx

z∇`tr(z)

∇`tr(z)

Page 70: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Key convergence condition for gradient stage

If there are NOT many samples, i.e. m � n:

∀z s.t. dist(z,x) ≤ ε‖x‖2:

〈∇`tr(z), x− z〉 & ‖z − x‖22 + ‖∇`tr(z)‖2

2

zx

z∇`(z)

∇`(z)

z

x

z∇`(z) ∇`(z)

zx

z∇`tr(z)

∇`tr(z)

Page 71: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Stability under noisy data

• Noisy data: yk = |a∗kx|2 + ηk

• Signal-to-noise ratio:

SNR :=∑k |a∗kx|4∑k η

2k

≈ 3m‖x‖4

‖η‖2

• i.i.d. Gaussian design

Theorem (Soltanolkotabi) WF converges to MLE

Theorem (Chen, Candes) Relative error of TWF converges to O( 1√SNR )

Page 72: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Stability under noisy data

• Noisy data: yk = |a∗kx|2 + ηk

• Signal-to-noise ratio:

SNR :=∑k |a∗kx|4∑k η

2k

≈ 3m‖x‖4

‖η‖2

• i.i.d. Gaussian design

Theorem (Soltanolkotabi) WF converges to MLE

Theorem (Chen, Candes) Relative error of TWF converges to O( 1√SNR )

Page 73: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Relative MSE vs. SNR (Poisson data)

SNR (dB) (n =1000)15 20 25 30 35 40 45 50 55

Rel

ativ

e M

SE (d

B)

-65

-60

-55

-50

-45

-40

-35

-30

-25

-20

m = 6n m = 8n m = 10n

Slope ≈ -1

Empirical evidence: relative MSE scales inversely with SNR

Page 74: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

This accuracy is nearly un-improvable (empirically)Comparison with genie-aided MLE (with sign info. revealed)

yk ∼ Poisson( |a∗kx|2 ) and εk = sign (a∗kx) (revealed by a genie)

SNR (dB) (n =100)15 20 25 30 35 40 45 50 55

Rel

ativ

e M

SE (d

B)

-65

-60

-55

-50

-45

-40

-35

-30

-25

-20

truncated WF

genie-aided MLE

little empirical loss due to missing signs

Page 75: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

This accuracy is nearly un-improvable (empirically)Comparison with genie-aided MLE (with sign info. revealed)

yk ∼ Poisson( |a∗kx|2 ) and εk = sign (a∗kx) (revealed by a genie)

SNR (dB) (n =100)15 20 25 30 35 40 45 50 55

Rel

ativ

e M

SE (d

B)

-65

-60

-55

-50

-45

-40

-35

-30

-25

-20

truncated WF

genie-aided MLE

little empirical loss due to missing signs

Page 76: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

This accuracy is nearly un-improvable (theoretically)

• Poisson data: ykind.∼ Poisson( |a∗kx|2 )

• Signal-to-noise ratio:

SNR ≈∑k |a∗kx|4∑k Var(yk) ≈ 3‖x‖2

Theorem (Chen, Candes). Under i.i.d. Gaussian design, for any estimator x,

infx

supx: ‖x‖≥log1.5 m

E[dist (x,x) | {ak}

]‖x‖

&1√

SNR,

provided that sample size m � n.

Page 77: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

This accuracy is nearly un-improvable (theoretically)

• Poisson data: ykind.∼ Poisson( |a∗kx|2 )

• Signal-to-noise ratio:

SNR ≈∑k |a∗kx|4∑k Var(yk) ≈ 3‖x‖2

Theorem (Chen, Candes). Under i.i.d. Gaussian design, for any estimator x,

infx

supx: ‖x‖≥log1.5 m

E[dist (x,x) | {ak}

]‖x‖

&1√

SNR,

provided that sample size m � n.

Page 78: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Phaseless 3D computational imaging

Fromenteze, Liu, Boyarsky, Gollub, & Smith ’16

benefit of the hardware simplification allowed by both approaches to propose a new paradigm forimaging systems. This demonstration is proposed in the microwave range as a proof of conceptto easily compare the reconstructions from complex valued and intensity measurements, pavingthe way for millimeter wave, terahertz, and photonic applications.

Here, we study the mathematical formalism and develop the conditions in which intensitymeasurement of compressed waveforms can be applied to retrieve the positions of targets in thenear field. The experimental setup and the associated parameters are defined in Fig. (2).

⇢(⌫) �(rr,⌫)

f(r)

g(r, rr,⌫)

Fig. 2. Computational imaging system used for the experimental demonstration. A metasur-face radiating frequency diverse patterns is applied to the localization of field sources fromthe intensity measurement of a compressed frequency domain waveform.

In this setup, a frequency diverse structure similar to those introduced in [9, 22, 25, 26] isconsidered. The large modal diversity excited by these metasurfaces allows for the radiationof structured field patterns which vary with the driving frequency. The quality factor of themetasurface is optimized to avoid modal degeneration, allowing for the radiation of a largenumber of pseudo-orthogonal spatial modes sensing the target space. The expression of themeasured frequency signal r(n) on the radiating device’s output port can be expressed as:

r(n) =Z

rr

Z

rf(rr,n) g(rr,r,n) f (r) d3r d2rr (10)

where f(rr,n) stands for the near field distribution of the metasurface measured at theaperture locations rr for each frequency n , g(rr,r,n) represents the Green function modeling thepropagation of field from the object space r and the metasurface’s aperture, and f (r) correspondsto a field source that is localized with this computational system. This problem can be expressedusing a matrix formalism by discretizing equation Eq. 10; we represented the resulting vectorsand matrices in bold notation as rrr = [ri]1in, rrrr = [rr j ]1 jnr , and nnn = [nk]1km. A sensingmatrix HHH 2 Cm⇥n is defined by the product of the Green function matrix GGGn,nr(nk) and the cavitynear-field response written in the vector form fff nr

(nk) for each frequency nk:

HHHn(nk) = GGGn,nr(nk) fff nr(nk) (11)

The sensing matrix allows for a representation of the linear dependency between the measuredfrequency signal rrr 2 Cm and the object fff 2 Cn, leading to the following formulation:

rrr = HHH fff (12)

Previous works demonstrated methods of reconstructing the image vector fff under certaininvertibility conditions of sensing matrix HHH, corresponding to a sufficient number of radiatedorthogonal patterns interrogating the target space [9]. This work extends the frame of computa-

Vol. 24, No. 15 | 25 Jul 2016 | OPTICS EXPRESS 16764

field source metasurfaceintensity

measurement

Measure intensities (with radiating metasurfaces) rather than complex signalsfor sub-centimeter wavelengths

-0.2-0.2

0

z (m

)

0

f - Computational imaging

x (m)

0.2

0 0.1

y (m)

0.20.30.40.50.2

-0.2-0.2

0

z (m

)

0

fI - Phaseless computational imaging

x (m)

0.2

0 0.1

y (m)

0.20.30.40.50.2

-25-20-15-10-50

Fig. 13. Localization of a field source on a domain of 10⇥10⇥10 voxels, with and withoutthe phase information. The blue square represents the array of equivalent dipoles constitutingthe radiating metasurface.

considering that a supplementary mounting structure (source of diffraction) was added after thenear-field characterization of the leaky cavity leading to a relative error e = 0.57. A more precisecomparison of the two estimated fields fff and fff I is proposed, extracting the x, y, and z-cuts fromthe maximum values (Fig. 14).

-0.2 0 0.2x (m)

-80

-60

-40

-20

0

Nor

mal

ized

mag

nitu

de (a

.u.)

0.3 0.4 0.5y (m)

-80

-60

-40

-20

0

Nor

mal

ized

mag

nitu

de (a

.u.)

-0.2 0 0.2z (m)

-80

-60

-40

-20

0

Nor

mal

ized

mag

nitu

de (a

.u.)

Fig. 14. Comparison of the x, y, and z-cuts extracted at the maximum value of the recon-structed fields fff and fff I . The orange solid lines correspond to the phaseless results fff I ,and are compared to the dashed blue lines standing for the reconstructions from complexmeasurements fff .

Comparable locations and resolutions are observed in both cases, validating the fidelity ofthe truncated Wirtinger flow applied to this phaseless computational system. A larger domain isconsidered for the last part of this experimental validation, studying the impact of the samplingm/n presented earlier in a practical situation. A domain of 20⇥ 20⇥ 10 = 4000 voxels isthus considered this time, conserving the same spatial sampling and centered at the samelocation. We now consider a sampling of m/n = 9800/4000 = 2.45 (with the approximation oftwo independent measurements rrr1 and rrr2). In contrast with the numerical simulations wherespatial random field distributions were considered, the experimental cases are focused on thereconstruction of a punctual point like object. Even if a relative error of e < 10�5 may not bereachable, we are interested in determining whether a localization of the field source is possible

Vol. 24, No. 15 | 25 Jul 2016 | OPTICS EXPRESS 16773

(red) phaseless reconstruction (blue) reconstruction w/ phase1

1This demonstration is proposed in microwave range as proof of concept

Page 79: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Phaseless 3D computational imaging

Fromenteze, Liu, Boyarsky, Gollub, & Smith ’16

benefit of the hardware simplification allowed by both approaches to propose a new paradigm forimaging systems. This demonstration is proposed in the microwave range as a proof of conceptto easily compare the reconstructions from complex valued and intensity measurements, pavingthe way for millimeter wave, terahertz, and photonic applications.

Here, we study the mathematical formalism and develop the conditions in which intensitymeasurement of compressed waveforms can be applied to retrieve the positions of targets in thenear field. The experimental setup and the associated parameters are defined in Fig. (2).

⇢(⌫) �(rr,⌫)

f(r)

g(r, rr,⌫)

Fig. 2. Computational imaging system used for the experimental demonstration. A metasur-face radiating frequency diverse patterns is applied to the localization of field sources fromthe intensity measurement of a compressed frequency domain waveform.

In this setup, a frequency diverse structure similar to those introduced in [9, 22, 25, 26] isconsidered. The large modal diversity excited by these metasurfaces allows for the radiationof structured field patterns which vary with the driving frequency. The quality factor of themetasurface is optimized to avoid modal degeneration, allowing for the radiation of a largenumber of pseudo-orthogonal spatial modes sensing the target space. The expression of themeasured frequency signal r(n) on the radiating device’s output port can be expressed as:

r(n) =Z

rr

Z

rf(rr,n) g(rr,r,n) f (r) d3r d2rr (10)

where f(rr,n) stands for the near field distribution of the metasurface measured at theaperture locations rr for each frequency n , g(rr,r,n) represents the Green function modeling thepropagation of field from the object space r and the metasurface’s aperture, and f (r) correspondsto a field source that is localized with this computational system. This problem can be expressedusing a matrix formalism by discretizing equation Eq. 10; we represented the resulting vectorsand matrices in bold notation as rrr = [ri]1in, rrrr = [rr j ]1 jnr , and nnn = [nk]1km. A sensingmatrix HHH 2 Cm⇥n is defined by the product of the Green function matrix GGGn,nr(nk) and the cavitynear-field response written in the vector form fff nr

(nk) for each frequency nk:

HHHn(nk) = GGGn,nr(nk) fff nr(nk) (11)

The sensing matrix allows for a representation of the linear dependency between the measuredfrequency signal rrr 2 Cm and the object fff 2 Cn, leading to the following formulation:

rrr = HHH fff (12)

Previous works demonstrated methods of reconstructing the image vector fff under certaininvertibility conditions of sensing matrix HHH, corresponding to a sufficient number of radiatedorthogonal patterns interrogating the target space [9]. This work extends the frame of computa-

Vol. 24, No. 15 | 25 Jul 2016 | OPTICS EXPRESS 16764

field source metasurfaceintensity

measurement

Measure intensities (with radiating metasurfaces) rather than complex signalsfor sub-centimeter wavelengths

-0.2-0.2

0

z (m

)

0

f - Computational imaging

x (m)

0.2

0 0.1

y (m)

0.20.30.40.50.2

-0.2-0.2

0

z (m

)

0

fI - Phaseless computational imaging

x (m)

0.2

0 0.1

y (m)

0.20.30.40.50.2

-25-20-15-10-50

Fig. 13. Localization of a field source on a domain of 10⇥10⇥10 voxels, with and withoutthe phase information. The blue square represents the array of equivalent dipoles constitutingthe radiating metasurface.

considering that a supplementary mounting structure (source of diffraction) was added after thenear-field characterization of the leaky cavity leading to a relative error e = 0.57. A more precisecomparison of the two estimated fields fff and fff I is proposed, extracting the x, y, and z-cuts fromthe maximum values (Fig. 14).

-0.2 0 0.2x (m)

-80

-60

-40

-20

0

Nor

mal

ized

mag

nitu

de (a

.u.)

0.3 0.4 0.5y (m)

-80

-60

-40

-20

0

Nor

mal

ized

mag

nitu

de (a

.u.)

-0.2 0 0.2z (m)

-80

-60

-40

-20

0

Nor

mal

ized

mag

nitu

de (a

.u.)

Fig. 14. Comparison of the x, y, and z-cuts extracted at the maximum value of the recon-structed fields fff and fff I . The orange solid lines correspond to the phaseless results fff I ,and are compared to the dashed blue lines standing for the reconstructions from complexmeasurements fff .

Comparable locations and resolutions are observed in both cases, validating the fidelity ofthe truncated Wirtinger flow applied to this phaseless computational system. A larger domain isconsidered for the last part of this experimental validation, studying the impact of the samplingm/n presented earlier in a practical situation. A domain of 20⇥ 20⇥ 10 = 4000 voxels isthus considered this time, conserving the same spatial sampling and centered at the samelocation. We now consider a sampling of m/n = 9800/4000 = 2.45 (with the approximation oftwo independent measurements rrr1 and rrr2). In contrast with the numerical simulations wherespatial random field distributions were considered, the experimental cases are focused on thereconstruction of a punctual point like object. Even if a relative error of e < 10�5 may not bereachable, we are interested in determining whether a localization of the field source is possible

Vol. 24, No. 15 | 25 Jul 2016 | OPTICS EXPRESS 16773

(red) phaseless reconstruction (blue) reconstruction w/ phase1

1This demonstration is proposed in microwave range as proof of concept

Page 80: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

No need of sample splitting

• Several prior works use sample-splitting: require fresh samples at eachiteration; not practical but much easier to analyze

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

1

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

1

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

1

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

1

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

1

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

use fresh samples

reuse all samples

1

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

1

• Our works: reuse all samples in all iterations

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

1

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

1

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

1

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

1

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

1

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

1

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

use fresh samples

same samples

1

Page 81: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

No need of sample splitting

• Several prior works use sample-splitting: require fresh samples at eachiteration; not practical but much easier to analyze

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

1

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

1

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

1

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

1

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

1

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

use fresh samples

reuse all samples

1

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

1

• Our works: reuse all samples in all iterations

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

1

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

1

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

1

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

1

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

1

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

1

sample complexity(# equations)

comput. costnmnmn2

n log n

n log3 n

infeasible

z0 z1 z2 z3 z4 z5 z6 z7

use fresh samples

same samples

1

Page 82: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

A small sample of more recent works

• other optimal algorithms◦ reshaped WF (Zhang et al.), truncated AF (Wang et al.), median-TWF (Zhang et al.)◦ alt-min w/o resampling (Waldspurger)◦ composite optimization (Duchi et al., Charisopoulos et al.)◦ approximate message passing (Ma et al.)◦ block coordinate descent (Barmherzig et al.)◦ PhaseMax (Goldstein et al., Bahmani et al., Salehi et al., Dhifallah et al., Hand et al.)

• stochastic algorithms (Kolte et al., Zhang et al., Lu et al., Tan et al., Jeong et al.)

• improved WF theory: iteration complexity → O(log n log 1ε) (Ma et al.)

• improved initialization (Lu et al., Wang et al., Mondelli et al.)

• random initialization (Chen et al.)

• structured quadratic systems (Cai et al., Soltanolkotabi, Wang et al., Yang et al.,Qu et al.)

• geometric analysis (Sun et al., Davis et al.)

• low-rank generalization (White et al., Li et al., Vaswani et al.)

Page 83: yc5/slides/Gatech2019_TWF.pdf · Agenda 1.The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2.Random initialization and implicit regularization

Central message

• Simple nonconvex paradigms are surprisingly effective for computing MLE• Importance of statistical thinking (initialization)

statistical accuracy comput. cost

convex relaxation

nonconvex procedure

• Y. Chen, E. Candes, “Solving random quadratic systems of equations is nearly as easy assolving linear systems,” Comm. Pure and Applied Math., 2017