robust pca - national university of singaporecs5240/lecture/robust-pca.pdf · robust pca...

Robust PCA

CS5240 Theoretical Foundations in Multimedia

Leow Wee Kheng

Department of Computer Science

School of Computing

National University of Singapore

Leow Wee Kheng (NUS) Robust PCA 1 / 52

Previously...

Previously...

not robust against outliers robust against outliers

linear least squares trimmed least squares

PCA trimmed PCA

Various ways to robustify PCA:

◮ trimming: remove outliers

◮ covariance matrix with 0-1 weight [Xu95]: similar to trimming

◮ weighted SVD [Gabriel79]: weighting

◮ robust error function [Torre2001]: winsorizing

Strength: simple conceptsWeakness: no guarantee of optimality


Robust PCA

Robust PCA

PCA can be formulated as follows:

Given a data matrix D, recover a low-rank matrix A from Dsuch that the error E = D−A is minimized:

minA,E

‖E‖F , subject to rank(A) ≤ r, D = A+E. (1)

◮ r ≪ min(m,n) is the target rank of A.

◮ ‖ · ‖F is the Frobenius norm.

Notes:

◮ This definition of PCA includes dimensionality reduction.

◮ PCA is severely affected by large-amplitude noise; not robust.


Robust PCA

[Wright2009] formulated the Robust PCA problem as follows:

Given a data matrix D = A+E where A and E are unknown

but A is low-rank and E is sparse, recover A.

An obvious way to state the robust PCA problem in math is:

Given a data matrix D, find A and E that solve the problem

minA,E

rank(A) + λ‖E‖0, subject to A+E = D. (2)

◮ λ is a Lagrange multiplier.

◮ ‖E‖0: l0-norm, number of non-zero elements in E.E is sparse if ‖E‖0 is small.


Robust PCA

minA,E

rank(A) + λ‖E‖0, subject to A+E = D. (2)

D A E

◮ Problem 2 is a matrix recovery problem.

◮ rank(A) and ‖E‖0 are not continuous, not convex;very hard to solve; no efficient algorithm.


Robust PCA

[Candes2011, Wright2009] reformulated Problem 2 as follows:

Given an m×n data matrix D, find A and E that solve

minA,E

‖A‖∗ + λ‖E‖1, subject to A+E = D. (3)

◮ ‖A‖∗: nuclear norm, sum of singular values of A;surrogate for rank(A).

◮ ‖E‖1: l1-norm, sum of absolute values of elements of E;surrogate for ‖E‖0.

Solution of Problem 3 can be recovered exactly if

◮ A is sufficiently low-rank but not sparse, and

◮ E is sufficiently sparse but not low-rank,

with optimal λ = 1/√max(m,n).

◮ ‖A‖∗ and ‖E‖1 are convex; can apply convex optimization.


Robust PCA Introduction to Convex Optimization

Introduction to Convex Optimization

For a differentiable function f(x), its minimizer x is given by

df(x)

dx=

[∂f(x)

∂x1· · ·

∂f(x)

∂xm

]⊤= 0. (4)

/ = 0df dxx

x

f



‖E‖1 is not differentiable when any of its element is zero!

−4−2

02

4

x1−4−2

02

4

x2

0

2

4

6

8

10

12

f(x)

Cannot write the following because they don’t exist:

d‖E‖1dE

,∂‖E‖1∂eij

,d|eij |

deij. WRONG! (5)

Fortunately, ‖E‖1 is convex.



A function f(x) is convex if and only if ∀x1,x2, ∀α ∈ [0, 1],

f(αx1 + (1− α)x2) ≤ αf(x1) + (1− α)f(x2). (6)

f x( )

x1

x2

x

x



A vector g(x) is a subgradient of convex function f at x if

f(y)− f(x) ≥ g(x)⊤(y − x), ∀y. (7)

f x( )

2l

3l

x2x1

1l

x

◮ At differentiable point x1: one unique subgradient = gradient.

◮ At non-differentiable point x2: multiple subgradients.



The subdifferential ∂f(x) is the set of all subgradients of f at x:

∂f(x) ={g(x) |g(x)⊤(y − x) ≤ f(y)− f(x), ∀y

}. (8)

Caution:

◮ Subdifferential ∂f(x) is not partial differentiation ∂f(x)/∂x.Don’t mix up.

◮ Partial differentiation is defined for differentiable functions.It is a single vector.

◮ Subdifferential is defined for convex functions, which may benon-differentiable. It is a set of vectors.



Example: Absolute value.

|x| is not differentiable at x = 0 but subdifferentiable at x = 0:

|x| =

x if x > 0,

0 if x = 0,

−x if x < 0.

∂|x| =

{+1} if x > 0,

[−1,+1] if x = 0,

{−1} if x < 0.

(9)

g = −1

g = 0

g = +1

x| |

0 < < 1g

x0

x

1

−1

0

absolute value subdifferential



Example: l1-norm of an m-D vector x.

‖x‖1 =m∑

i=1

|xi|, ∂‖x‖1 =m∑

i=1

∂|xi|. (10)

Caution: Right-hand-side of Eq. 10 is set addition.This gives a product of m sets, one for each element xi:

∂‖x‖1 = J1 × · · · × Jm, Ji =

{+1} if xi > 0,

[−1,+1] if xi = 0,

{−1} if xi < 0.

(11)

Alternatively, we can write ∂‖x‖1 as

∂‖x‖1 = {g } such that gi

= +1 if xi > 0,

∈ [−1,+1] if xi = 0,

= −1 if xi < 0.

(12)



Another alternative:

∂‖x‖1 = {g } such that

{gi = sgn(gi) if |xi| > 0,

|gi| ≤ 1 if xi = 0.(13)

Here are some examples of the subdifferentials of 2D l1-norm.

∂f(0, 0) = [−1, 1]× [−1, 1] ∂f(1, 0) = {1} × [−1, 1] ∂f(1, 1) = {(1, 1)}



Optimality Condition

x is a minimum of f if 0 ∈ ∂f(x), i.e., 0 is a subgradient of f .

f x( )

g x( ) = 0

xx

For more details on convex functions and convex optimization, refer to[Bertsekas2003, Boyd2004, Rockafellar70, Shor85].


Robust PCA Back to Robust PCA

Back to Robust PCA

Shrinkage or soft threshold operator:

Tε(x) =

x− ε if x > ε,

x+ ε if x < −ε,

0 otherwise.

(14)

− ε

T x( )ε

xε0


Robust PCA Back to Robust PCA

Using convex optimization, the following minimizers are shown[Cai2010, Hale2007]:

For an m×n matrix M with SVD USV⊤,

UTε(S)V⊤ = argmin

X

ε‖X‖∗ +1

2‖X−M‖2F , (15)

Tε(M) = argminX

ε‖X‖1 +1

2‖X−M‖2F . (16)


Robust PCA Solving Robust PCA

Solving Robust PCA

There are several ways to solve robust PCA (Problem 3)[Lin2009,Wright2009]:

◮ principal component pursuit

◮ iterative thresholding

◮ accelerated proximal gradient

◮ augmented Lagrange multipliers


Augmented Lagrange Multipliers


Consider the problem

minx

f(x) subject to cj(x) = 0, j = 1, . . . ,m. (17)

This is a constrained optimization problem.

Lagrange multipliers method reformulates Problem 17 as:

minx

f(x) +m∑

j=1

λjcj(x) (18)

with some constants λj .



Penalty method reformulates Problem 17 as:

minx

f(x) + µ

m∑

j=1

c2j (x). (19)

◮ Parameter µ increases over iteration, e.g., by factor of 10.◮ Need µ → ∞ to get good solution.

Augmented Lagrange multipliers method combinesLagrange multipliers and penalty:

minx

f(x) +m∑

j=1

λjcj(x) +µ

2

m∑

j=1

c2j (x). (20)

Denote λ = [λ1 · · · λm]⊤, c(x) = [c1(x) · · · cm(x)]⊤.Then, Problem 20 becomes

minx

f(x) + λ⊤c(x) +

µ

2‖c(x)‖2. (21)



If the constraints form a matrix C = [cjk], then define Λ = [λjk].

Then Problem 20 becomes

minx

f(x) + 〈Λ,C(x)〉+µ

2‖C(x)‖2F . (22)

◮ 〈Λ,C〉 is sum of product of corresponding elements:

〈Λ,C〉 =∑

j

∑

k

λjkcjk.

◮ ‖C‖F is the Frobenius norm:

‖C‖2F =∑

j

∑

k

c2jk.



Augmented Lagrange Multipliers (ALM) Method

1. Initialize Λ, µ > 0, ρ ≥ 1.

2. Repeat until convergence:

2.1 Compute x = argminx L(x) where

L(x) = f(x) + 〈Λ,C(x)〉+µ

2‖C(x)‖2F .

2.2 Update Λ = Λ+ µC(x).

2.3 Update µ = ρµ.

What kind of optimization algorithm is this?

ALM does not need µ → ∞ to get good solution.That means, can converge faster.



With ALM, robust PCA (Problem 3) is reformulated as

minA,E

‖A‖∗ + λ‖E‖1 + 〈Y,D−A−E〉+µ

2‖D−A−E‖2F , (23)

◮ Y are the Lagrange multipliers.

◮ Constraints C = D−A−E.

To implement Step 2.1 of ALM, need to find minima for A and E.Adopt alternating optimization.


Augmented Lagrange Multipliers Trace of Matrix

Trace of Matrix

The trace of a matrix A is the sum of its diagonal elements aii:

tr(A) =n∑

i=1

aii. (24)

Strictly speaking, scalar c and a 1×1 matrix [ c ] are not the same thing.Nevertheless, since tr([ c ]) = c, we often write, for simplicity tr(c) = c.

Properties

◮ tr(A) =n∑

i=1

λi, where λi are the eigenvalues of A.

◮ tr(A+B) = tr(A) + tr(B)

◮ tr(cA) = c tr(A)

◮ tr(A) = tr(A⊤)


Augmented Lagrange Multipliers Trace of Matrix

◮ tr(ABCD) = tr(BCDA) = tr(CDAB) = tr(DABC)

◮ tr(X⊤Y) = tr(XY⊤) = tr(Y⊤X) = tr(YX⊤) =∑

i

∑

j

xijyij



From Problem 23, the minimal E with other variables fixed is given by

minE

λ‖E‖1 + 〈Y,D−A−E〉+µ

2‖D−A−E‖2F

⇒ minE

λ‖E‖1 + tr(Y⊤(D−A−E)) +µ

2‖D−A−E‖2F

⇒ minE

λ‖E‖1 + tr(−Y⊤E) +µ

2tr((D−A−E)⊤(D−A−E))

... (Homework)

⇒ minE

λ‖E‖1 +µ

2tr((E− (D−A+Y/µ))⊤(E− (D−A+Y/µ)))

⇒ minE

λ‖E‖1 +µ

2‖E− (D−A+Y/µ)‖2F

⇒ minE

λ

µ‖E‖1 +

1

2‖E− (D−A+Y/µ)‖2F . (25)



Compare Eq. 16

Tε(M) = argminX

ε‖X‖1 +1

2‖X−M‖2F ,

and Problem 25

minE

λ

µ‖E‖1 +

1

2‖E− (D−A+Y/µ)‖2F .

Set X = E, M = D−A+Y/µ. Then,

E = Tλ/µ(D−A+Y/µ). (26)



From Problem 23, the minimal A with other variables fixed is given by

minA

‖A‖∗ + 〈Y,D−A−E〉+µ

2‖D−A−E‖2F

⇒ minA

+ tr(Y⊤(D−A−E)) +µ

2‖D−A−E‖2F

... (Homework)

⇒ minA

1

µ‖A‖∗ +

1

2‖A− (D−E+Y/µ)‖2F (27)

Compare Problem 27 and Eq. 15

UTε(S)V⊤ = argmin

X

ε‖X‖∗ +1

2‖X−M‖2F ,

Set X = A, M = D−E+Y/µ. Then,

A = UT1/µ(S)V⊤. (28)



Robust PCA for Matrix Recovery via ALMInputs: D.

1. Initialize A = 0, E = 0.

2. Initialize Y, µ > 0, ρ > 1.



5. U,S,V = SVD(D−E+Y/µ).

6. A = UT1/µ(S)V⊤.

7. E = Tλ/µ(D−A+Y/µ).

8. Update Y = Y + µ(D−A−E).

9. Update µ = ρµ.

Outputs: A, E.



Typical initialization [Lin2009]:

◮ Y = sgn(D)/J(sgn(D)).

◮ sgn(D) gives sign of each matrix element of D.

◮ J(·) gives scaling factors:

J(X) = max(‖X‖2, λ−1‖X‖∞).

◮ ‖X‖2 is spectral norm, largest singular value of X.

◮ ‖X‖∞ is largest absolute value of elements of X.

◮ µ = 1.25 ‖D‖2.

◮ ρ = 1.5.

◮ λ = 1/√max(m,n) for m×n matrix D.



Example: Recovery of video background.

video frames

data matrix D



Sample results:

data matrix low-rank error low-rank error



Example: Removal of specular reflection and shadow.

D A E


Fixed-Rank Robust PCA


In reflection removal, reflection may be global.

ground-truth input

Then, E is not sparse: violate RPCA condition!But, rank of A = 1.

Fix the rank of A to deal with non-sparse E [Leow2013].



Fixed-Rank Robust PCA via ALMInputs: D.






6. If rank(T1/µ(S)) < r, A = UT1/µ(S)V⊤; else A = USrV

⊤.

7. E = Tλ/µ(D−A+Y/µ).



Outputs: A, E.

Sr is S with last m− r singular values set to 0.



Example: Removal of local reflections.

ground-truth input

FRPCA RPCA



Example: Removal of global reflections.

ground-truth input

FRPCA RPCA



Example: Background recovery for traffic video: fast moving vehicles.

input

FRPCA RPCA



Example: Background recovery for traffic video: slow moving vehicles.

input

FRPCA RPCA



Example: Background recovery for traffic video: temporary stop.

input

FRPCA RPCA


Matrix Completion

Matrix Completion

Customers are asked to rate the movies from 1 (poor) to 5 (excellent).

A 5 5 3 2B 3 4 3C 3 4 5 4D 4 5 5 3...

Customers rate only some movies ⇒ some data are missing.How to estimate the missing data? matrix completion.


Matrix Completion

Let D denote data matrix with missing elements set to 0, andM = {(i, j)} denote the indices of missing elements in D.

Then, the matrix completion problem can be formulated as

Given D and M , find matrix A that solves the problem

minA

‖A‖∗ subject to A+E = D, Eij = 0 ∀ (i, j) /∈ M. (29)

◮ For (i, j) /∈ M , constrain Eij = 0 so that Aij = Dij ; no change.

◮ For (i, j) ∈ M , Dij = 0, i.e., Aij = Eij ; recovered value.

Reformulating Problem 29 using augmented Lagrange multipliers gives

minA

‖A‖∗ + 〈Y,D−A−E〉+µ

2‖D−A−E‖2F (30)

such that Eij = 0 ∀ (i, j) /∈ M .


Matrix Completion

Robust PCA for Matrix CompletionInputs: D.






6. A = UT1/µ(S)V⊤.

7. E = ΓM (D−A+Y/µ), where

ΓM (X) =

{Xij , for (i, j) ∈ M.0, for (i, j) /∈ M,



Outputs: A, E.


Matrix Completion

Example: Recovery of occluded parts in face images.

input ground matrix matrixtruth completion recovery


Summary

Summary

Robustification of PCA

PCA

trimming

0−1 covariance

weighted SVD

robust error function

trimmed PCA

trimmed PCA

weighted PCA

winsorized PCA


Summary

Robust PCA

robust PCA(matrix recovery)

fixed−rankrobust PCA

robust PCA(matrix completion)

low−rank + sparsematrix decomposition

fix elements of Efix rank of A

PCA


Probing Questions

Probing Questions

◮ If the data matrix of a problem is composed of a low-rank matrix,a sparse matrix and something else, can you still use robust PCAmethods? If yes, how? If no, why?

◮ In application of robust PCA to high-resolution colour imageprocessing, the data matrix contains three times as many rows asthe number of pixels in the images, which can lead to a very largedata matrix that takes a long time to compute. Suggest a way toovercome this problem.

◮ In application of robust PCA to video processing, the data matrixcontains as many columns as the number of video frames, whichcan lead to a very large data matrix that is more than theavailable memory required to store the matrix. Suggest a way toovercome this problem.


Homework

Homework

1. Show thattr(X⊤Y) =

∑

i

∑

j

xijyij .

2. Show that the following two optimization problems are equivalent:

minE

λ‖E‖1 + tr(−Y⊤E) +µ

2tr((D−A−E)⊤(D−A−E))

minE

λ‖E‖1 +µ

2tr((E− (D−A+Y/µ))⊤(E− (D−A+Y/µ)))

3. Show that the minimal A of Problem 23 with other variables fixedis given by

minA

1

µ‖A‖∗ +

1

2‖A− (D−E+Y/µ)‖2F .


References

References I

1. D. P. Bertsekas. Convex Analysis and Optimization. Athena Scientific,2003.

2. S. Boyd and L. Vandenberghe. Convex Optimization. CambridgeUniversity Press, 2004.

3. J. F. Cai, E. J. Candes, Z. Shen. A singular value thresholding algorithmfor matrix completion. SIAM Journal of Optimization, 20(4):1956–1982,2010.

4. E. J. Candes, X. Li, Y. Ma, and J. Wright. Robust principal componentanalysis? Journal of ACM, 58(3), 2011.

5. F. De la Torre and M. J. Black. Robust principal component analysis forcomputer vision. In Proc. Int. Conf. Computer Vision, 2001.

6. R. T. Rockafellar. Convex Analysis. Princeton University Press, 1970.


References

References II

7. K. Gabriel and S. Zamir. Lower rank approximation of matrices by leastsquares with any choice of weights. Technometrics, 21:489-498, 1979.

8. E. T. Hale, W. Yin, Y. Zhang. A fixed-point continuation method forl1-regularized minimization with applications to compressed sensing.Tech. Report TR07-07, Dept. of Comp. and Applied Math., RiceUniversity, 2007.

9. W. K. Leow, Y. Cheng, L. Zhang, T. Sim, and L. Foo. Backgroundrecovery by fixed-rank robust principal component analysis. In Proc.Int. Conf. on Computer Analysis of Images and Patterns, 2013.

10. Z. Lin, M. Chen, L. Wu, and Y. Ma. The augmented Lagrange multipliermethod for exact recovery of corrupted low-rank matrices. Tech. Report

UILU-ENG-09-2215, UIUC, 2009.

11. N. Z. Shor. Minimization Methods for Non-Differentiable Functions.Springer-Verlag, 1985.


References

References III

12. J. Wright, Y. Peng, Y. Ma, A. Ganesh, and S. Rao. Robust principalcomponent analysis: Exact recovery of corrupted low-rank matrices byconvex optimization. In Proc. Conf. on Neural Information Processing

Systems, 2009.

13. L. Xu and A. Yuille. Robust principal component analysis byself-organizing rules based on statistical physics approach. IEEE Trans.

Neural Networks, 6(1):131-143, 1995.


robust pca - national university of singaporecs5240/lecture/robust-pca.pdf · robust pca...

Documents