distributed subgradient methods for saddle-point...

Distributed Subgradient Methodsfor Saddle-Point Problems

David Mateos-Nunez Jorge Cortes

University of California, San Diego

{dmateosn,cortes}@ucsd.edu

Conference on Decision and ControlOsaka, Japan, December 17, 2015

1 / 21

Context: decentralized, peer-to-peer systems

Sensor networks Medical diagnosis

Formation control Recommender systems

2 / 21

General agenda for today

Review of (consensus based) distributed convex optimization

Part 1:

Distributed optimization with separable constraintsvia agreement on the Lagrange multipliers

? General saddle-point problems with explicit agreement

Part 2:

? Convex-concave problems not arising from Lagrangianse.g., strict concave part

Distributed low-rank matrix completionthrough a sadlle-point characterization of the nuclear norm

3 / 21

Review: consensus based distributed convex optimization

x∗ ∈ arg minx∈Rd

N∑i=1

f i (x) (basic unconstrained problem)

Agent i has access to f i

Agent i can share its estimate of x∗ with “neighboring” agents

f 1 f 2

f 3 f 4

f 5

A =

a13 a14

a21 a25a31 a32a41

a52

︸︷︷︸

Adjacency matrix

Parallel computations: Tsitsiklis 84, Bertsekas and Tsitsiklis 95

Consensus: Jadbabaie et al. 03, Olfati-Saber, Murray 04, Boyd et al. 05

Distributed multi-agent optimization: A. Nedic and A. Ozdaglar 074 / 21

Review: the Laplacian matrix

L = diag(A1)− A =

a13 + a14 −a13 −a14−a21 a21 + a25 −a25−a31 −a32 a31 + a32−a41 a41

−a52 a52

Nullspace is agreement ⇔ graph has spanning tree

Consensus via feedback on disagreement −[Lx ]i =∑N

j=1 aij(xj − xi )

5 / 21

Part 1: Distributed constrained convex optimization

minw i∈Wi , ∀i

D∈D

N∑i=1

f i (w i ,D)

s.t. g1(w1,D) + · · ·+ gN(wN ,D) ≤ 0︸︷︷︸Constraints might couple decisions of agents that cannot communicate directly

e.g.,∑

‖w i‖22 − 10 ≤ 0

Agent i only knows how wi enters the constraint through gi

D is the usual decision vector the agents need to agree upon

Constraints are useful models for

Traffic and routing (Flow conservation)

Resource allocation (Budgets)

Optimal control (System evolution)

Network formation (Relative positions/angles)

6 / 21

Agenda for distributed constrained optimization

Previous work and limitations

Distributing the constraints through the Lagrangian decomposition

I Idea: Agreement on the multiplier

General saddle-point problems with agreement constraints

Our distributed saddle-point dynamics with Laplacian averaging

I Theorem of convergence: ∼ 1√# iter.

saddle-point evaluation error

7 / 21

Previous work by type of constraint & info. structure

N∑i=1

f i (x)

s.t. g(x) ≤ 0

All agents know g

2011 D. Yuan, S. Xu, and H. Zhao

2012 M. Zhu and S. Martınez

Increasing literature

minw i∈Wi

N∑i=1

f i (w i )

s.t.

N∑i=1

g i (w i ) ≤ 0

Agent i knows only giwhen Aw ≤ 0, knows only column i

versus column i & row i of A

Less studied information structure:

’10 D. Mosk-Aoyama,T. Roughgarden and D. Shah

I (only linear constraints)

’13 M. Burger, G. Notarstefano,and F. Allgwer

I Dual cutting-plane consensus methods

’13 T.-H. Chang, A. Nedic,and A. Scaglione

I Primal-dual perturbation methods8 / 21

Distributing the constraint via agreement on multipliers

minw i∈WiD∈D

N∑i=1

f i (w i ,D)

s.t. g1(w1,D) + · · ·+ gN(wN ,D) ≤ 0

same as

minw i∈WiD∈D

maxz∈Rm

≥0

N∑i=1

f i (w i ,D) + z>N∑i=1

g i (w i ,D)

= minw i∈WiD∈D

maxz i∈Rm

≥0

z i=z j ∀i ,j

N∑i=1

(f i (w i ,D) + z i

>g i (w i ,D)

)

= minw i∈Wi

D i∈DD i=Dj ∀i ,j

maxz i∈Rm

≥0

z i=z j ∀i ,j

N∑i=1

(f i (w i ,D i ) + z i>g i (w i ,D i )

)︸︷︷︸Local coupled through agreement

(Existence of saddle-points ⇒ Max-min property = Strong duality )9 / 21

Saddle-point problems with explicit agreement

A more general framework

minw∈W

(D1,...,DN)∈DN

D i=Dj , ∀i ,j

maxµ∈M

(z1,...,zN)∈ZN

z i=z j , ∀i ,j

φ(w , (D1, . . . ,DN)︸︷︷︸

convex

,µ, (z1, . . . , zN)︸︷︷︸concave

)

Distributed setting unstudied in the literatureInspiration from A. Nedic and A. Ozdaglar, 09 and K. Arrow, et al. 1958

Particularizes to...

Convex-concave functions arising from Lagrangians

I The concave part is linear

Min-max formulation of nuclear norm regularization (later in talk)

I The concave part is quadratic

10 / 21

Our general algorithm

Projected saddle-point subgradient algorithm with Laplacian averaging(provided a saddle-point exists under agreement)

wt+1 = wt − ηtgwt

Dt+1 = Dt −σLtDt︸︷︷︸Lap. aver.

−ηtgDt

µt+1 = µt + ηtgµt

zt+1 = zt −σLtzt︸︷︷︸Lap. aver.

+ηt gzt

(wt+1,Dt+1,µt+1, zt+1) = (PW(wt+1

),PDN

(Dt+1

),PM

(µt+1

),PZN

(zt+1

))︸︷︷︸

Orthogonal projections onto compact convex sets

gwt , gDt , gµt , gzt are subgradients and supergradients of φ(w ,D,µ, z)

Any initial conditions; not “anytime constraint satisfaction”

11 / 21

Theorem (Distributed saddle-point approximation)

Assume that

φ(w ,D,µ, z) is convex in (w ,D) ∈ W ×DN and concave in(µ, z) ∈ M×ZN

The dynamics is bounded (maybe achieved through projections)

The sequence of weight-balanced communication digraphs isI δ-nondegenerate (aij > δ whenever aij > 0)I B-jointly-connected (unions of length B are strongly connected)

For a suitable choice of consensus stepsize σ and (decreasing) subgradientstepsizes {ηt}, then, for any saddle point (w∗,D∗,µ∗, z∗)︸︷︷︸

D∗=D∗⊗1,z∗=z∗⊗1

of φ ,

− α√t − 1

≤φ(w avt ,Dav

t , zavt ,µav

t )− φ(w∗,D∗, z∗,µ∗) ≤ α√t − 1

w avt+1 :=

1t+1

t+1∑s=1

ws =t

t+1wavt + 1

t+1wt+1︸︷︷︸can be computed recursively

12 / 21

Part 2: Beyond Lagrangians

Lagrangians are particular cases of convex-concave functions– the concave part (Lagrange multipliers) is always linear

Other min-max problems can benefit from distributed formulations– e.g., min-max formulations of the nuclear norm

Agenda for distributed optimization with nuclear normregularization

Definition of nuclear norm

Application to low-rank matrix completion

Our dynamics for distributed optimization with nuclear normI Theorem of convergence (corollary from previous result)

13 / 21

Review: definition of nuclear norm

Given a matrix W =

| |w1 · · · wN

| |

∈ Rd×N

‖W ‖2∗ := sum of singular values of W

= trace√WW > = trace

√√√√ N∑i=1

wiw>i

Optimization with nuclear norm regularization

minwi∈Wi

N∑i=1

f i (wi ) + γ‖W ‖2∗

favors vectors {wi}Ni=1 belonging to a low-dimensional subspace14 / 21

Distributed low-rank matrix completion

Tara Philip Mauricio Miroslav

Toy Story ? ? ?

Jurasic Park ??

· · · · · · · · · · · · · · ·W = [ W:,1 W:,2 W:,3 W:,4 ]

Estimate W from the revealed entries {Zij}

minW∈Rd×N

∑(i ,j)∈revealed

‖Wij − Zij‖22 + γ ‖W ‖∗︸︷︷︸Nuclear norm

γ depends on application, dimensions... (Regularization, not penalty)

Netflix: users N ∼ 107 � movies d ∼ 105

Why making it distributed? Because users may not want to sharetheir ratings

15 / 21

Formulation of nuclear norm as saddle-point problem

Drawing from another paper of the authors (ignore details)

minwi∈Wi

N∑i=1

f i (wi ) + γ‖[W |√ε Id ]‖∗

= minwi∈Wi ,Di∈{D�cId}

Di = Dj ∀i , j

supxi∈Rd

Yi∈Rd×d

N∑i=1

Fi (wi ,Di︸︷︷︸convex

, xi ,Yi︸︷︷︸concave

)

with convex-concave local functions

Fi (w ,D, x ,Y )︸︷︷︸Rd×{D�cId}×Rd×Rd×d

:= fi (w)︸︷︷︸convex

+ γ(

trace(D(−xx> − ε

NYY>︸︷︷︸

quadratic concave part because D � 0

))

−2γw>x − 2γε

Ntrace(Y ) +

1

Ntrace(D)︸︷︷︸

linear in each variable

)

See Distributed optimization for multi-task learning via nuclear-norm

approximation, NecSys15, D. Mateos-Nunez, J. Cortes 16 / 21

Distributed saddle-point dynamics for nuclear optimization

wi (k + 1) =PW(wi (k)− ηk

(gi (k)− 2γxi (k)

))Di (k + 1) =P{D�cId}

(Di (k)− ηkγ

(− xix

>i − ε

NYiY>i + 1

N Id)

+ σ

N∑j=1

aij ,t(Dj(k)− Di (k))︸︷︷︸“Only” communication, size d × d

)

xi (k + 1) = xi (k) + ηkγ(− 2Di (k)xi (k)− 2wi (k)

)Yi (k + 1) =Yi (k) + ηkγ

(− 2ε

NDi (k)Yi (k)−

2ε

NId)

Convergence is a corollary from previous theorem

User i does not need to share wi with its neighbors!!

Di →√∑N

i=1 wiw>i + εId conveys only mixed information

Complexity per iteration: orthogonal projection onto {D � cId}17 / 21

Simulation of matrix completion

20 users × 8 movies. Each user rates 5 movies. Ratings are private

100

101

102

103

104

0

0.5

1

1.5

2

Distr ibuted saddle-point dynamics

Centralized subgradient descent

100

101

102

103

104

103

104

0 5000 10000 150000

1

2

3

4

5

ite rat ions, k

‖W (k)−Z‖F‖Z‖F

∑Ni=1

∑j∈Υi

(Wij(k)− Zij)2 + γ‖W (k)|

√εId‖∗

(∑N

i=1 ‖Di (k)− 1N

∑Ni=1Di (k)‖2F )1/2

Matrixfitting error

Networkcost function

Disagreementlocal matrices

18 / 21

Conclusions

More details in

arXiv “Distributed saddle-point subgradient algorithms withLaplacian averaging,” submitted to Transactions on Automatic Control

Our algorithms particularize to deal with

I Saddle-points of Lagrangians for distributed constrained optimization

F Less studied type of constraints/information structure in the literature

F Constraints couple decisions of agents that can’t communicate directly

I Min-max distributed formulations of nuclear norm

“Distributed optimization for multi-task learning vianuclear-norm approximation”, D. Mateos-Nunez, J. Cortes

F First multi-agent treatment of nuclear norm regularization

19 / 21

Future directions

Bounds on Lagrange multipliers in a distributed wayI Necessary to guarantee boundedness of the dynamics’ trajectoriesI One such procedure in arXiv version

Application to semidefinite constraints with chordal sparsityI agents update the entries corresponding to maximal cliques subject to

agreement on the intersections

Other applications that you can find...

IEEE Spectrum. Japan project of Orbital Solar Farm

20 / 21

Thank you for listening!

21 / 21

(Back slide) Outline of the proof

Inequality techniques from A. Nedic and A. Ozdaglar, 2009

I Saddle-point evaluation error

tφ(w avt+1,D

avt+1,µ

avt+1, z

avt+1)− tφ(w∗,D∗,µ∗, z∗) (1)

at running-time averages, w avt+1 :=

1t

∑ts=1 ws , etc.

I Bound for (1) in terms ofF initial conditionsF bound on subgradients and states of the dynamicsF disagreementF sum of learning rates

Input-to-state stability with respect to agreement

‖LKDt‖2 ≤ CI‖D1‖2(1− δ

4N2

)d t−1B e

+ CU max1≤s≤t−1

‖ds‖2︸︷︷︸subgradients as disturbances

Doubling Trick scheme: for m = 0, 1, 2, . . . , dlog2 te, takeηs =

1√2m

in each period of 2m rounds s = 2m, . . . , 2m+1 − 11 / 1

distributed subgradient methods for saddle-point...

Documents