newton-laplace updates for block coordinate descent

1
Newton-Laplace Updates for Block Coordinate Descent Si Yi Meng Mark Schmidt University of British Columbia Motivation I Block coordinate descent (BCD) methods are used to optimize many machine learning objectives. - Easy to implement. - Low memory requirements. - Cheap iteration costs. - Adaptability to distributed settings. I Applications: Group Lasso, SVMs, etc. I Consider the problem: arg min xR n f (x) where f is convex and twice continuously differentiable. BCD update rule x k +1 b k = x k b k + d k I First order updates use a gradient descent direction: d k = -α k b k f (x k ) where α k is the step size. I Speed up convergence using Blockwise-Newton updates d k = -α k h 2 b k b k f (x k ) i -1 b k f (x k ) - Possible to obtain superlinear convergence for problems with certain structures - Use of larger blocks can lead to faster convergence, but iteration cost is O(|b k | 3 ) - Step sizes are often chosen with line-search. Contributions Previously: I When the chosen block’s sparsity pattern has a tree structure, “message-passing” algorithms can be used to solve the system in linear time. I One can also exploit the width of the Hessian’s computation graph to speed up the blockwise-Newton update. I There could still be a limit to how large the blocks can grow until these structural constraints are violated. Our work: I Newton-Laplace updates: if the sub-Hessian corresponding to the blocks have a Laplacian/SDD structure, we can use fast Laplacian solvers to compute the blockwise-Newton update in near-linear time. I Empirically demonstrate its fast convergence rate for the graph-based semi-supervised learning objectives. Laplacian solver Laplacian matrix: I Consider an undirected, possibly weighted graph G =(V , E ): - n = |V| is number of vertices. - Adjacency matrix W and degree matrix D where D ii = j W ij . - The Laplacian matrix is defined as L = D - W . I L has many nice properties: - Symmetric and diagonally dominant (SDD): L ii j 6=i |L ij | - Positive semi-definite: L T = L 0. - Multiplicity of the eigenvalue 0 indicates the number of connected components. 1 3 2 4 2 -1 0 -1 -1 3 -1 -1 0 -1 2 -1 -1 -1 -1 3 Laplacian systems Lx = b: I These systems arise in many applications: graph-based semi-supervised learning, solutions of PDEs via finite element, max flows, resistor networks, etc. I A natural approach to solving symmetric linear systems is via Cholesky factorization. - But fill-in can create a bottleneck: the triangular factors L in L = LDL T can be dense even when L is sparse. - For Laplacian systems, variable elimination corresponds to vertex eliminiation, and fill-in corresponds to adding a clique to the vertex’s neighbours. Fast solvers: I Generate a sparse, approximate Cholesky decomposition for Laplacian matrices by: - Randomizing the order of elimination, which allows for bound on the sample variance. - Simpler procedure to estimate the effective resistances for computing the edge sampling probabilities. I Cost is near-linear in the number of non-zeros in L. I Efficient implementations in Julia for sparse Laplacian/SDD systems: Laplacians.jl I Performance benchmark (with 10% sparsity in L): 0 2500 5000 7500 10000 0 2 4 6 build times seconds 0 2500 5000 7500 10000 0 1000 2000 3000 4000 5000 6000 build allocations MB 0 2500 5000 7500 10000 0 1 2 3 4 5 solve times n vertices seconds 0 2500 5000 7500 10000 0 1000 2000 3000 4000 5000 6000 solve allocations n vertices MB Approximate Cholesky with graph sparsification PCG with augmented spanning tree preconditioners KMP solvers Application Graph-based semi-supervised learning problem: I Given: a partial labeling of n examples x =(x l ,x u ) and pairwise similarities w ij , where l is the set of indices corresponds to labeled examples, and u =[n] \ l is for unlabeled. I Goal: infer the missing labels. Label propagation objective f (x u )= 1 2 X i[n],j u w ij h(x i - x j )+ 1 2 X i, j u w ij h(x i - x j ). I When h(z)= z 2 , f (x u )=(D uu - W uu ) x u - W ul y l 2 f (x u )= L uu where L uu is the Laplacian on the graph formed by the unlabeled examples, and the analytical solution can be obtained from x * u = L -1 uu (W ul x l ). I Thus the (sub-)Hessian of f is a Laplacian matrix, and every blockwise-Newton update therefore involves solving a Laplacian system. I We can replace h(·) with the huber loss: h (z )= ( 1 2 z 2 for |z |≤ (|z |- 1 2 ) otherwise and the resulting update will also be a Laplacian system. I Best of both worlds! - Cheap iterations using Laplacian solvers. - Fast convergence from second order updates. Experiments I 2 -regularized label propagation on the “two-moons” dataset. - 2000 examples with 100 labeled and n = 1900 unlabeled. I Pairwise label distances measured in the Huber loss. I Newton-Laplace updates using the approximate Cholesky Laplacian solver. I Using exact solvers with fixed-size blocks chosen to satisfy a particular iteration cost: - For example, choosing |b k | = n 1/3 gives an iteration cost of O(n) with generic, exact solvers. - Although we are able to obtain a slightly faster convergence rate with larger blocks, the associated iteration costs have increased from O(n) to O(n 2 ). I Using tree partitioning and compute the update direction using “message-passing” algorithms: - We can now use larger blocks with O(n) iteration cost, but block size is still limited. I Using Laplacian solvers: - We can use full blocks and reach the minimum within 10 iterations. Convergence rate comparison of using different fixed-block sizes using different block selection strategies: I All of these bear only a O(|b|) iteration cost. Take home message: When memory is not an issue, one should take advantage of these solvers when the Hessian has or can be closely approximated with a Laplacian/SDD structure.

Upload: others

Post on 29-Oct-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Newton-Laplace Updates for Block Coordinate Descent

Newton-Laplace Updates for Block Coordinate DescentSi Yi Meng Mark Schmidt

University of British Columbia

Motivation

I Block coordinate descent (BCD) methods are used to optimize many machine learning objectives.- Easy to implement.- Low memory requirements.- Cheap iteration costs.- Adaptability to distributed settings.

I Applications: Group Lasso, SVMs, etc.

I Consider the problem: argminx∈Rn f (x) where f is convex and twice continuously differentiable.

BCD update rule

xk+1bk= xkbk

+ dk

I First order updates use a gradient descent direction:

dk = −αk∇bkf (xk) where αk is the step size.

I Speed up convergence using

Blockwise-Newton updates

dk = −αk[∇2bkbk

f (xk)]−1∇bkf (x

k)

- Possible to obtain superlinear convergence for problems with certain structures- Use of larger blocks can lead to faster convergence, but iteration cost is O(|bk|3)- Step sizes are often chosen with line-search.

Contributions

Previously:

I When the chosen block’s sparsity pattern has a tree structure, “message-passing” algorithmscan be used to solve the system in linear time.

I One can also exploit the width of the Hessian’s computation graph to speed up theblockwise-Newton update.

I There could still be a limit to how large the blocks can grow until these structuralconstraints are violated.

Our work:

I Newton-Laplace updates: if the sub-Hessian corresponding to the blocks have aLaplacian/SDD structure, we can use fast Laplacian solvers to compute theblockwise-Newton update in near-linear time.

I Empirically demonstrate its fast convergence rate for the graph-based semi-supervisedlearning objectives.

Laplacian solver

Laplacian matrix:

I Consider an undirected, possibly weighted graph G = (V , E):- n = |V| is number of vertices.- Adjacency matrix W and degree matrix D where Dii =

∑jWij.

- The Laplacian matrix is defined as L = D −W .

I L has many nice properties:- Symmetric and diagonally dominant (SDD): Lii ≥

∑j 6=i |Lij|

- Positive semi-definite: LT = L � 0.- Multiplicity of the eigenvalue 0 indicates the number of connected components.

1

3

2

4

2 −1 0 −1−1 3 −1 −10 −1 2 −1−1 −1 −1 3

Laplacian systems Lx = b:

I These systems arise in many applications: graph-based semi-supervised learning, solutions ofPDEs via finite element, max flows, resistor networks, etc.

I A natural approach to solving symmetric linear systems is via Cholesky factorization.- But fill-in can create a bottleneck: the triangular factors L in L = LDLT can be denseeven when L is sparse.- For Laplacian systems, variable elimination corresponds to vertex eliminiation, and fill-incorresponds to adding a clique to the vertex’s neighbours.

Fast solvers:

I Generate a sparse, approximate Cholesky decomposition for Laplacian matrices by:- Randomizing the order of elimination, which allows for bound on the sample variance.- Simpler procedure to estimate the effective resistances for computing the edge samplingprobabilities.

I Cost is near-linear in the number of non-zeros in L.

I Efficient implementations in Julia for sparse Laplacian/SDD systems: Laplacians.jl

I Performance benchmark (with 10% sparsity in L):

0 2500 5000 7500 100000

2

4

6

build times

seco

nds

0 2500 5000 7500 100000

1000

2000

3000

4000

5000

6000

build allocations

MB

0 2500 5000 7500 100000

1

2

3

4

5

solve times

n vertices

seco

nds

0 2500 5000 7500 100000

1000

2000

3000

4000

5000

6000

solve allocations

n vertices

MB

Approximate Cholesky with graph sparsification

PCG with augmented spanning tree preconditioners

KMP solvers

Application

Graph-based semi-supervised learning problem:

I Given: a partial labeling of n examples x = (xl, xu) and pairwise similarities wij, where l isthe set of indices corresponds to labeled examples, and u = [n] \ l is for unlabeled.

I Goal: infer the missing labels.

Label propagation objective

f (xu) =1

2

∑i∈[n], j∈u

wij h(xi − xj) +1

2

∑i, j∈u

wij h(xi − xj).

I When h(z) = z2,

∇f (xu) = (Duu −Wuu)xu −Wul yl∇2f (xu) = Luu

where Luu is the Laplacian on the graph formed by the unlabeled examples, and theanalytical solution can be obtained from x∗u = L−1uu(Wul xl).

I Thus the (sub-)Hessian of f is a Laplacian matrix, and every blockwise-Newton updatetherefore involves solving a Laplacian system.

I We can replace h(·) with the huber loss:

hε(z) =

{12z

2 for |z| ≤ ε

ε(|z| − 12ε) otherwise

and the resulting update will also be a Laplacian system.

I Best of both worlds!- Cheap iterations using Laplacian solvers.- Fast convergence from second order updates.

Experiments

I `2-regularized label propagation on the “two-moons” dataset.- 2000 examples with 100 labeled and n = 1900 unlabeled.

I Pairwise label distances measured in the Huber loss.

I Newton-Laplace updates using the approximate Cholesky Laplacian solver.

I Using exact solvers with fixed-size blocks chosen to satisfy a particular iteration cost:- For example, choosing |bk| = n1/3 gives an iteration cost of O(n) with generic, exact solvers.- Although we are able to obtain a slightly faster convergence rate with larger blocks, the associated iterationcosts have increased from O(n) to O(n2).

I Using tree partitioning and compute the update direction using “message-passing” algorithms:- We can now use larger blocks with O(n) iteration cost, but block size is still limited.

I Using Laplacian solvers:- We can use full blocks and reach the minimum within 10 iterations.

Convergence rate comparison of using different fixed-block sizes using different block selection strategies:

I All of these bear only a O(|b|) iteration cost.

Take home message:When memory is not an issue, one should take advantage of these solvers when the Hessianhas or can be closely approximated with a Laplacian/SDD structure.