efficient linear programming for dense crfs · e cient linear programming for dense crfs...

Efficient Linear Programming for Dense CRFsThalaiyasingam Ajanthan1 Alban Desmaison2 Rudy Bunel2 Mathieu Salzmann3 Philip H. S. Torr2 M. Pawan Kumar2,4

1Australian National University & Data61, CSIRO 2Department of Engineering Science, University of Oxford 3CVLab, EPFL 4Alan Turing Institute

Introduction

AIM: Efficient and accurate energy minimization in fully connected CRFs.

Dense CRF energy: Defined on a set of n random variables X = {X1, . . . , Xn}, where each random variable Xa takesa label xa ∈ L.

E(x) =

n∑a=1

φa(xa)︸︷︷︸unary potentials

+

n∑a=1

n∑b=1,b,a

ψab(xa, xb)︸︷︷︸pairwise potentials

.

ICaptures long-range interactions and provides fine grained segmentations [5]. Sparse CRF Dense CRF

Difficulty: There are O(n2) number of variables (n ≈ 106)⇒ Even the energy computation is Intractable.

Gaussian pairwise potentials:

ψab(xa, xb) = 1[xa , xb] exp(−‖fa − fb‖

2

2

)︸︷︷︸

Kab

where fa ∈ IRd .

v′a

n×1

=

Kab

n×n

vb

n×1

IApproximate the above computation using the filtering method [1]⇒O(n) computations.

Existing efficient algorithms:IMean Field (MF) [5], Quadratic Programming (QP) [3] and Difference of Convex (DC) Programming [3].IAll these algorithms rely on the efficient filtering method and have linear time complexity per iteration.

Drawback: No multiplicative bound⇒ The solution obtained by these algorithms can be far away from the optimum.

Linear Programming (LP) relaxation provides the best multiplicative bound but the existing algorithm is too slow.

Linear Programming (LP) Relaxation

I Introduce indicator variables: ya:i = 1⇒ xa = i.

miny

E(y) =∑

a

∑i

φa:i ya:i +∑a,b,a

∑i

Kab|ya:i − yb:i|

2,

s.t. y ∈ M =

{y∑

i ya:i = 1, a ∈ {1 . . . n}ya:i ≥ 0, a ∈ {1 . . . n}, i ∈ L

}.

IFor integer labellings E(y) = E(x).

E.g., L = {0, 1, 2, 3, 4}xa = 2⇒ ya = [0, 0, 1, 0, 0]xb = 3⇒ yb = [0, 0, 0, 1, 0]

1[xa , xb] =∑

i|ya:i−yb:i|

2

ILP provides an integrality gap of 2 [4], and no better relaxation can be designed due to the UGC hardness result [7].

Difficulty: Standard LP solvers would require O(n2) variables⇒ Intractable.

Require auxiliary variables: |ya:i − yb:i| = max{ya:i − yb:i, yb:i − ya:i} ⇒ zab:i ≥ ya:i − yb:i and zab:i ≥ yb:i − ya:i.

ITherefore the non-smooth LP objective has to be tackled directly.

Existing algorithm [3]: Projected subgradient descent⇒ Too slow.ILinearithmic time per iteration (due to a divide-and-conquer strategy).IExpensive line search to obtain the step size.IRequires large number of iterations.

E.g., n = 4v′1v′2v′3v′4

=

0 K12 K13 K14

0 0 K23 K24

0 0 0 K34

0 0 0 0

1111

LP subgradient computation.

Contribution: LP in linear time per iteration⇒ An order of magnitude speedup (n = 106⇒ 20 times speedup).

IThe first LP minimization algorithm for dense CRFs that maintains linear scaling in both time and space complexity.

Our Approach: Proximal Minimization of LP

Algorithm overview:Proximal minimization using block-coordinate descent.IOne block: Significantly smaller subproblems.IThe other block: Efficient conditional gradient descent.ILinear time complexity per iteration.IOptimal step size.

IGurantees optimality and converges faster.

Proximal problem: Let λ > 0 and yk be the current estimate

miny

E(y) +1

2λ

∥∥∥y − yk∥∥∥2,

s.t. y ∈ M .

IThe quadratic regularization makes the dual problem smooth⇒ Efficient optimization of the dual.

Dual of the Proximal Problem

Dual variables:Iα =

{α1

ab:i, α2ab:i | a, b , a, i ∈ L

},

α1ab:i : zab:i ≥ ya:i − yb:i

α2ab:i : zab:i ≥ yb:i − ya:i .

Iβ = {βa | a ∈ {1 . . . n}} , βa :∑

i ya:i = 1 .

Iγ = {γa:i | a ∈ {1 . . . n}, i ∈ L} , γa:i : ya:i ≥ 0 .

IThe matrices A and B are used to compactly write the dual function.Iα always appear in the product Aα⇒ Space complexity is linear.

minα,β,γ

g(α,β,γ) =λ

2‖Aα + Bβ + γ − φ‖2 +

⟨Aα + Bβ + γ − φ, yk

⟩− 〈1,β〉 ,

s.t. γa:i ≥ 0 ∀ a ∈ {1 . . . n} ∀ i ∈ L ,

α ∈ C =

{αα1

ab:i + α2ab:i = Kab

2 , a, b , a, i ∈ Lα1

ab:i, α2ab:i ≥ 0, a, b , a, i ∈ L

}.

Optimizing over β and γ

Iβ: Unconstrained⇒ Set derivative to zero.Iγ: Unbounded and separable⇒ A m dimensional QP for each pixel, where m is the number of labels.IEfficiently optimized using the multiplicative iteration algorithm of [8] (≈ 2 milliseconds).

For each pixel a, the QP has the formminγa≥0

12γ

Ta Qγa +

⟨γa,Q

((Aαt)a − φa

)+ yk

a

⟩,

where Q ∈ IRm×m.

Optimizing over α

Conditional gradient descent:minα∈C

g(α) ,

g is differentiable and C is convex and compact.

Conditional gradient computation:IMinimize the first order Taylor approximation.

Step size computation:Iδ = 2/(t + 2) or by line search. Image from [6].

In our case:IThe conditional gradient has the same form as the subgradient of LP [2].IThe optimal step size can be computed analytically.IThese benefits are due to our choice of quadratic regularization.

∀ a ∈ {1 . . . n}, v′a =∑

b

Kab vb 1[ya ≥ yb] ,

where ya, yb ∈ [0, 1].

E.g., n = 3v′1v′2v′3

=

K11 K12 K13

0 K22 K23

0 0 K33

v1

v2

v3

Linear time conditional gradient computation:Difficulty: The original permutohedral lattice based filtering method of [1] cannot handle the ordering constraint.

Existing method [3]: Repeated application of the filtering method using a divide-and-conquer strategy⇒O(d2n log(n)) computations (d = 2 or 5).

Our idea: Discretize the interval [0, 1] to H levels and instantiate H permutohedral lattices⇒O(Hdn) computations (H = 10).

Modified filtering method:ICreates single permutohedral lattice and reuses it H times⇒ Cache efficient.ISubstantial speedup (> 40) compared to the state-of-the-art method of [3].

IOur modified filtering method has wide applicabilitybeyond what is shown in this work.

Segmentation Results

Datasets: Pascal VOC 2010 and MSRC.

0 5 10 15 20 25 30 35

Time (s)

1

2

3

4

5

6

7

8

9

10

Integral

energy

×106

MFDCnegSG-LPℓ

PROX-LPacc

0 20 40 60 80 100 120

Time (s)

5

5.2

5.4

5.6

5.8

6

6.2

6.4

6.6

6.8

7

Integral

energy

×105

MFDCnegSG-LPℓ

PROX-LPacc

Energy vs time plots for animage in MSRC and Pascal.

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

Energy

(×10

3)

MFDCnegSG-LPℓPROX-LPacc

0

2

4

6

8

10

12

14

Tim

e(s)

0

10

20

30

40

50

60

70

80

90

Accuracy

0

10

20

30

40

50

60

70

IoU

0

200

400

600

800

1000

1200

1400

Energy

(×10

3)

MFDCnegSG-LPℓPROX-LPacc

0

10

20

30

40

50

60

70

80

90

Tim

e(s)

0

10

20

30

40

50

60

70

80

90

Accuracy

0

5

10

15

20

25

30

IoU

Energy, time, segmentation accuracy and IoU score of differentalgorithms on (top) MSRC and Pascal (bottom) datasets.

IBoth LP minimization algorithms are initialized with DCneg.

Image MF DCneg SG-LP` PROX-LPacc Ground truth

Modified filtering method:

0.5 1 1.5 2

Number of pixels ×105

10

20

30

40

50

60

Speedup

0 5 10 15 20

Number of labels

10

20

30

40

50

60

Spatial kernel (d = 2)

0.5 1 1.5 2

Number of pixels ×105

10

20

30

40

50

Speedup

0 5 10 15 20

Number of labels

10

20

30

40

50

Bilateral kernel (d = 5)

Speedup of our modified filtering method over that of [3].

Speedup is around45 − 65 on a Pascal

image.

Discussion

IWe have introduced an efficient and accurate algorithm for energy minimization indense CRFs.IOur algorithm can be incorporated in an end-to-end learning framework to furtherimprove the accuracy of deep semantic segmentaion architectures.

Code: https://github.com/oval-group/DenseCRF

[1] Adams et al., Computer Graphics Forum, 2010.[2] Bach F, SIAM, 2015.[3] Desmaison et al., ECCV, 2016.[4] Kleinberg and Tardos, ACM, 2002.

[5] Krähenbühl and Koltun, NIPS, 2011.[6] Lacoste-Julien et al., ICML, 2012.[7] Manokaran et al., STOC, 2008.[8] Xiao et al., Numerical Linear Algebra, 2014.

efficient linear programming for dense crfs · e cient linear programming for dense crfs...

Documents