efficient linear programming for dense crfs · e cient linear programming for dense crfs...

1
Ecient Linear Programming for Dense CRFs Thalaiyasingam Ajanthan 1 Alban Desmaison 2 Rudy Bunel 2 Mathieu Salzmann 3 Philip H. S. Torr 2 M. Pawan Kumar 2,4 1 Australian National University & Data61, CSIRO 2 Department of Engineering Science, University of Oxford 3 CVLab, EPFL 4 Alan Turing Institute Introduction AIM: Ecient and accurate energy minimization in fully connected CRFs. Dense CRF energy: Defined on a set of n random variables X = { X 1 ,..., X n }, where each random variable X a takes a label x a ∈L. E (x) = n X a=1 φ a ( x a ) |{z} unary potentials + n X a=1 n X b=1,b,a ψ ab ( x a , x b ) | {z } pairwise potentials . I Captures long-range interactions and provides fine grained segmentations [5]. Sparse CRF Dense CRF Diculty: There are O(n 2 ) number of variables (n 10 6 ) Even the energy computation is Intractable. Gaussian pairwise potentials: ψ ab ( x a , x b ) = [ x a , x b ] exp -kf a - f b k 2 2 ! | {z } K ab where f a IR d . v 0 a n×1 = K ab n×n v b n×1 I Approximate the above computation using the filtering method [1] O(n) computations. Existing ecient algorithms: I Mean Field (MF) [5], Quadratic Programming (QP) [3] and Dierence of Convex (DC) Programming [3]. I All these algorithms rely on the ecient filtering method and have linear time complexity per iteration. Drawback: No multiplicative bound The solution obtained by these algorithms can be far away from the optimum. Linear Programming (LP) relaxation provides the best multiplicative bound but the existing algorithm is too slow. Linear Programming (LP) Relaxation I Introduce indicator variables: y a:i = 1 x a = i. min y ˜ E (y) = X a X i φ a:i y a:i + X a,b,a X i K ab |y a:i - y b:i | 2 , s.t. y ∈M = ( y i y a:i = 1, a ∈{1 ... n} y a:i 0, a ∈{1 ... n}, i ∈L ) . I For integer labellings ˜ E (y) = E (x). E.g., L = {0, 1, 2, 3, 4} x a = 2 y a = [0, 0, 1, 0, 0] x b = 3 y b = [0, 0, 0, 1, 0] [ x a , x b ] = i |y a:i -y b:i | 2 I LP provides an integrality gap of 2 [4], and no better relaxation can be designed due to the UGC hardness result [7]. Diculty: Standard LP solvers would require O(n 2 ) variables Intractable. Require auxiliary variables: |y a:i - y b:i | = max{y a:i - y b:i , y b:i - y a:i }⇒ z ab:i y a:i - y b:i and z ab:i y b:i - y a:i . I Therefore the non-smooth LP objective has to be tackled directly. Existing algorithm [3]: Projected subgradient descent Too slow. I Linearithmic time per iteration (due to a divide-and-conquer strategy). I Expensive line search to obtain the step size. I Requires large number of iterations. E.g., n = 4 v 0 1 v 0 2 v 0 3 v 0 4 = 0 K 12 K 13 K 14 0 0 K 23 K 24 0 0 0 K 34 0 0 0 0 1 1 1 1 LP subgradient computation. Contribution: LP in linear time per iteration An order of magnitude speedup (n = 10 6 20 times speedup). I The first LP minimization algorithm for dense CRFs that maintains linear scaling in both time and space complexity. Our Approach: Proximal Minimization of LP Algorithm overview: Proximal minimization using block-coordinate descent. I One block: Significantly smaller subproblems. I The other block: Ecient conditional gradient descent. I Linear time complexity per iteration. I Optimal step size. I Gurantees optimality and converges faster. Proximal problem: Let λ> 0 and y k be the current estimate min y ˜ E (y) + 1 2λ y - y k 2 , s.t. y ∈M . I The quadratic regularization makes the dual problem smooth Ecient optimization of the dual. Dual of the Proximal Problem Dual variables: I α = n α 1 ab:i 2 ab:i | a, b , a, i ∈L o , α 1 ab:i : z ab:i y a:i - y b:i α 2 ab:i : z ab:i y b:i - y a:i . I β = {β a | a ∈{1 ... n}} , β a : i y a:i = 1 . I γ = {γ a:i | a ∈{1 ... n}, i ∈ L} , γ a:i : y a:i 0 . I The matrices A and B are used to compactly write the dual function. I α always appear in the product Aα Space complexity is linear. min α,β,γ g(α, β, γ) = λ 2 k Aα + Bβ + γ - φk 2 + D Aα + Bβ + γ - φ, y k E -h1, βi , s.t. γ a:i 0 a ∈{1 ... n} i ∈L , α ∈C = ( α α 1 ab:i + α 2 ab:i = K ab 2 , a, b , a, i ∈L α 1 ab:i 2 ab:i 0, a, b , a, i ∈L ) . Optimizing over β and γ I β: Unconstrained Set derivative to zero. I γ: Unbounded and separable A m dimensional QP for each pixel, where m is the number of labels. I Eciently optimized using the multiplicative iteration algorithm of [8] (2 milliseconds). For each pixel a, the QP has the form min γ a 0 1 2 γ T a Qγ a + D γ a , Q ( ( Aα t ) a - φ a ) + y k a E , where Q IR m×m . Optimizing over α Conditional gradient descent: min α∈C g(α) , g is dierentiable and C is convex and compact. Conditional gradient computation: I Minimize the first order Taylor approximation. Step size computation: I δ = 2/(t + 2) or by line search. Image from [6]. In our case: I The conditional gradient has the same form as the subgradient of LP [2]. I The optimal step size can be computed analytically. I These benefits are due to our choice of quadratic regularization. a ∈{1 ... n}, v 0 a = X b K ab v b [y a y b ] , where y a , y b [0, 1]. E.g., n = 3 v 0 1 v 0 2 v 0 3 = K 11 K 12 K 13 0 K 22 K 23 0 0 K 33 v 1 v 2 v 3 Linear time conditional gradient computation: Diculty: The original permutohedral lattice based filtering method of [1] cannot handle the ordering constraint. Existing method [3]: Repeated application of the filtering method using a divide-and-conquer strategy O(d 2 n log(n)) computations (d = 2 or 5). Our idea: Discretize the interval [0, 1] to H levels and instantiate H permutohedral lattices O( Hdn) computations ( H = 10). Modified filtering method: I Creates single permutohedral lattice and reuses it H times Cache ecient. I Substantial speedup (> 40) compared to the state-of-the-art method of [3]. I Our modified filtering method has wide applicability beyond what is shown in this work. Segmentation Results Datasets: Pascal VOC 2010 and MSRC. 0 5 10 15 20 25 30 35 Time (s) 1 2 3 4 5 6 7 8 9 10 Integral energy ×10 6 MF DC neg SG-LP PROX-LP acc 0 20 40 60 80 100 120 Time (s) 5 5.2 5.4 5.6 5.8 6 6.2 6.4 6.6 6.8 7 Integral energy ×10 5 MF DC neg SG-LP PROX-LP acc Energy vs time plots for an image in MSRC and Pascal. 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Energy (× 10 3 ) MF DC neg SG-LP PROX-LP acc 0 2 4 6 8 10 12 14 Time (s) 0 10 20 30 40 50 60 70 80 90 Accuracy 0 10 20 30 40 50 60 70 IoU 0 200 400 600 800 1000 1200 1400 Energy (× 10 3 ) MF DC neg SG-LP PROX-LP acc 0 10 20 30 40 50 60 70 80 90 Time (s) 0 10 20 30 40 50 60 70 80 90 Accuracy 0 5 10 15 20 25 30 IoU Energy, time, segmentation accuracy and IoU score of dierent algorithms on (top) MSRC and Pascal (bottom) datasets. I Both LP minimization algorithms are initialized with DC neg . Image MF DC neg SG-LP PROX-LP acc Ground truth Modified filtering method: 0.5 1 1.5 2 Number of pixels ×10 5 10 20 30 40 50 60 Speedup 0 5 10 15 20 Number of labels 10 20 30 40 50 60 Spatial kernel (d = 2) 0.5 1 1.5 2 Number of pixels ×10 5 10 20 30 40 50 Speedup 0 5 10 15 20 Number of labels 10 20 30 40 50 Bilateral kernel (d = 5) Speedup of our modified filtering method over that of [3]. Speedup is around 45 - 65 on a Pascal image. Discussion I We have introduced an ecient and accurate algorithm for energy minimization in dense CRFs. I Our algorithm can be incorporated in an end-to-end learning framework to further improve the accuracy of deep semantic segmentaion architectures. Code: https://github.com/oval-group/DenseCRF [1] Adams et al., Computer Graphics Forum, 2010. [2] Bach F, SIAM, 2015. [3] Desmaison et al., ECCV, 2016. [4] Kleinberg and Tardos, ACM, 2002. [5] Krähenbühl and Koltun, NIPS, 2011. [6] Lacoste-Julien et al., ICML, 2012. [7] Manokaran et al., STOC, 2008. [8] Xiao et al., Numerical Linear Algebra, 2014.

Upload: vudan

Post on 24-May-2019

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Linear Programming for Dense CRFs · E cient Linear Programming for Dense CRFs Thalaiyasingam Ajanthan1 Alban Desmaison2 Rudy Bunel2 Mathieu Salzmann3 Philip H. S. Torr2

Efficient Linear Programming for Dense CRFsThalaiyasingam Ajanthan1 Alban Desmaison2 Rudy Bunel2 Mathieu Salzmann3 Philip H. S. Torr2 M. Pawan Kumar2,4

1Australian National University & Data61, CSIRO 2Department of Engineering Science, University of Oxford 3CVLab, EPFL 4Alan Turing Institute

Introduction

AIM: Efficient and accurate energy minimization in fully connected CRFs.

Dense CRF energy: Defined on a set of n random variables X = {X1, . . . , Xn}, where each random variable Xa takesa label xa ∈ L.

E(x) =

n∑a=1

φa(xa)︸︷︷︸unary potentials

+

n∑a=1

n∑b=1,b,a

ψab(xa, xb)︸ ︷︷ ︸pairwise potentials

.

ICaptures long-range interactions and provides fine grained segmentations [5]. Sparse CRF Dense CRF

Difficulty: There are O(n2) number of variables (n ≈ 106)⇒ Even the energy computation is Intractable.

Gaussian pairwise potentials:

ψab(xa, xb) = 1[xa , xb] exp(−‖fa − fb‖

2

2

)︸ ︷︷ ︸

Kab

where fa ∈ IRd .

v′a

n×1

=

Kab

n×n

vb

n×1

IApproximate the above computation using the filtering method [1]⇒O(n) computations.

Existing efficient algorithms:IMean Field (MF) [5], Quadratic Programming (QP) [3] and Difference of Convex (DC) Programming [3].IAll these algorithms rely on the efficient filtering method and have linear time complexity per iteration.

Drawback: No multiplicative bound⇒ The solution obtained by these algorithms can be far away from the optimum.

Linear Programming (LP) relaxation provides the best multiplicative bound but the existing algorithm is too slow.

Linear Programming (LP) Relaxation

I Introduce indicator variables: ya:i = 1⇒ xa = i.

miny

E(y) =∑

a

∑i

φa:i ya:i +∑a,b,a

∑i

Kab|ya:i − yb:i|

2,

s.t. y ∈ M =

{y∑

i ya:i = 1, a ∈ {1 . . . n}ya:i ≥ 0, a ∈ {1 . . . n}, i ∈ L

}.

IFor integer labellings E(y) = E(x).

E.g., L = {0, 1, 2, 3, 4}xa = 2⇒ ya = [0, 0, 1, 0, 0]xb = 3⇒ yb = [0, 0, 0, 1, 0]

1[xa , xb] =∑

i|ya:i−yb:i|

2

ILP provides an integrality gap of 2 [4], and no better relaxation can be designed due to the UGC hardness result [7].

Difficulty: Standard LP solvers would require O(n2) variables⇒ Intractable.

Require auxiliary variables: |ya:i − yb:i| = max{ya:i − yb:i, yb:i − ya:i} ⇒ zab:i ≥ ya:i − yb:i and zab:i ≥ yb:i − ya:i.

ITherefore the non-smooth LP objective has to be tackled directly.

Existing algorithm [3]: Projected subgradient descent⇒ Too slow.ILinearithmic time per iteration (due to a divide-and-conquer strategy).IExpensive line search to obtain the step size.IRequires large number of iterations.

E.g., n = 4v′1v′2v′3v′4

=

0 K12 K13 K14

0 0 K23 K24

0 0 0 K34

0 0 0 0

1111

LP subgradient computation.

Contribution: LP in linear time per iteration⇒ An order of magnitude speedup (n = 106⇒ 20 times speedup).

IThe first LP minimization algorithm for dense CRFs that maintains linear scaling in both time and space complexity.

Our Approach: Proximal Minimization of LP

Algorithm overview:Proximal minimization using block-coordinate descent.IOne block: Significantly smaller subproblems.IThe other block: Efficient conditional gradient descent.ILinear time complexity per iteration.IOptimal step size.

IGurantees optimality and converges faster.

Proximal problem: Let λ > 0 and yk be the current estimate

miny

E(y) +1

∥∥∥y − yk∥∥∥2,

s.t. y ∈ M .

IThe quadratic regularization makes the dual problem smooth⇒ Efficient optimization of the dual.

Dual of the Proximal Problem

Dual variables:Iα =

{α1

ab:i, α2ab:i | a, b , a, i ∈ L

},

α1ab:i : zab:i ≥ ya:i − yb:i

α2ab:i : zab:i ≥ yb:i − ya:i .

Iβ = {βa | a ∈ {1 . . . n}} , βa :∑

i ya:i = 1 .

Iγ = {γa:i | a ∈ {1 . . . n}, i ∈ L} , γa:i : ya:i ≥ 0 .

IThe matrices A and B are used to compactly write the dual function.Iα always appear in the product Aα⇒ Space complexity is linear.

minα,β,γ

g(α,β,γ) =λ

2‖Aα + Bβ + γ − φ‖2 +

⟨Aα + Bβ + γ − φ, yk

⟩− 〈1,β〉 ,

s.t. γa:i ≥ 0 ∀ a ∈ {1 . . . n} ∀ i ∈ L ,

α ∈ C =

{αα1

ab:i + α2ab:i = Kab

2 , a, b , a, i ∈ Lα1

ab:i, α2ab:i ≥ 0, a, b , a, i ∈ L

}.

Optimizing over β and γ

Iβ: Unconstrained⇒ Set derivative to zero.Iγ: Unbounded and separable⇒ A m dimensional QP for each pixel, where m is the number of labels.IEfficiently optimized using the multiplicative iteration algorithm of [8] (≈ 2 milliseconds).

For each pixel a, the QP has the formminγa≥0

12γ

Ta Qγa +

⟨γa,Q

((Aαt)a − φa

)+ yk

a

⟩,

where Q ∈ IRm×m.

Optimizing over α

Conditional gradient descent:minα∈C

g(α) ,

g is differentiable and C is convex and compact.

Conditional gradient computation:IMinimize the first order Taylor approximation.

Step size computation:Iδ = 2/(t + 2) or by line search. Image from [6].

In our case:IThe conditional gradient has the same form as the subgradient of LP [2].IThe optimal step size can be computed analytically.IThese benefits are due to our choice of quadratic regularization.

∀ a ∈ {1 . . . n}, v′a =∑

b

Kab vb 1[ya ≥ yb] ,

where ya, yb ∈ [0, 1].

E.g., n = 3v′1v′2v′3

=

K11 K12 K13

0 K22 K23

0 0 K33

v1

v2

v3

Linear time conditional gradient computation:Difficulty: The original permutohedral lattice based filtering method of [1] cannot handle the ordering constraint.

Existing method [3]: Repeated application of the filtering method using a divide-and-conquer strategy⇒O(d2n log(n)) computations (d = 2 or 5).

Our idea: Discretize the interval [0, 1] to H levels and instantiate H permutohedral lattices⇒O(Hdn) computations (H = 10).

Modified filtering method:ICreates single permutohedral lattice and reuses it H times⇒ Cache efficient.ISubstantial speedup (> 40) compared to the state-of-the-art method of [3].

IOur modified filtering method has wide applicabilitybeyond what is shown in this work.

Segmentation Results

Datasets: Pascal VOC 2010 and MSRC.

0 5 10 15 20 25 30 35

Time (s)

1

2

3

4

5

6

7

8

9

10

Integral

energy

×106

MFDCnegSG-LPℓ

PROX-LPacc

0 20 40 60 80 100 120

Time (s)

5

5.2

5.4

5.6

5.8

6

6.2

6.4

6.6

6.8

7

Integral

energy

×105

MFDCnegSG-LPℓ

PROX-LPacc

Energy vs time plots for animage in MSRC and Pascal.

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Energy

(×10

3)

MFDCnegSG-LPℓPROX-LPacc

0

2

4

6

8

10

12

14

Tim

e(s)

0

10

20

30

40

50

60

70

80

90

Accuracy

0

10

20

30

40

50

60

70

IoU

0

200

400

600

800

1000

1200

1400

Energy

(×10

3)

MFDCnegSG-LPℓPROX-LPacc

0

10

20

30

40

50

60

70

80

90

Tim

e(s)

0

10

20

30

40

50

60

70

80

90

Accuracy

0

5

10

15

20

25

30

IoU

Energy, time, segmentation accuracy and IoU score of differentalgorithms on (top) MSRC and Pascal (bottom) datasets.

IBoth LP minimization algorithms are initialized with DCneg.

Image MF DCneg SG-LP` PROX-LPacc Ground truth

Modified filtering method:

0.5 1 1.5 2

Number of pixels ×105

10

20

30

40

50

60

Speedup

0 5 10 15 20

Number of labels

10

20

30

40

50

60

Spatial kernel (d = 2)

0.5 1 1.5 2

Number of pixels ×105

10

20

30

40

50

Speedup

0 5 10 15 20

Number of labels

10

20

30

40

50

Bilateral kernel (d = 5)

Speedup of our modified filtering method over that of [3].

Speedup is around45 − 65 on a Pascal

image.

Discussion

IWe have introduced an efficient and accurate algorithm for energy minimization indense CRFs.IOur algorithm can be incorporated in an end-to-end learning framework to furtherimprove the accuracy of deep semantic segmentaion architectures.

Code: https://github.com/oval-group/DenseCRF

[1] Adams et al., Computer Graphics Forum, 2010.[2] Bach F, SIAM, 2015.[3] Desmaison et al., ECCV, 2016.[4] Kleinberg and Tardos, ACM, 2002.

[5] Krähenbühl and Koltun, NIPS, 2011.[6] Lacoste-Julien et al., ICML, 2012.[7] Manokaran et al., STOC, 2008.[8] Xiao et al., Numerical Linear Algebra, 2014.