a bootstrap framework for low-rank matrix...

Low-rank matrix estimation A bootstrap framework Results Other noise

A bootstrap framework for low-rankmatrix estimation

Julie Josse – Stefan Wager

Laboratoire de statistique, Agrocampus Ouest, Rennes, FranceStatistic department, Stanford University

Malgorzata Bogdan group meeting, Wroclaw University ofTechnology, 31 october 2014

1 / 30


Low-rank matrix estimation

A bootstrap framework

Results

Other noise

2 / 30



⇒ Model: X ∈ Rn×p ∼ L(µ) with E [X ] = µ of low-rank k

⇒ Gaussian noise model:

X = µ+ ε, with εijiid∼N

(0, σ2)

Image, collaborative filtering...

⇒ Classical solution: truncated SVD

µ̂k =k∑

l=1

ul dl v>l

3 / 30


Shrinking - thresholding SVD

⇒ Hard thresholding: dl · 1(l ≤ k) = dl · 1(dl ≥ τ)

argminµ{‖X − µ‖22 : rank (µ) ≤ k

}Selecting k? Cross-validation (Josse & Husson, 2012)

Chatterjee, (2013), Donoho & Gavish, (2014): an optimal hardthreshold with better asymptotic MSE than k

⇒ Soft threshoding: dl max(1− τ

dl, 0)

argminµ{‖X − µ‖22 + λ‖µ‖∗

}Selecting λ? Stein’s Unbiased Risk Estimate (Candès et al., 2012)

4 / 30



⇒ Non linear shrinkage: µ̂shrink =∑min{n, p}

l=1 ul ψ (dl ) v>l

• Shabalin & Nobel (2013); Gavish & Donoho (2014). Asymptoticn = np and p → ∞, np/p → β, 0 < β ≤ 1

ψ(dl ) =1dl

√(d2l − β − 1

)2 − 4β · 1(l ≥ (1+

√β))

• Verbank, Josse & Husson (2013). Asymptotic n, p fixed, σ → 0

ψ(dl ) = dl

(d2l − σ2

d2l

)· 1(l ≤ k)

signal variancetotal variance

5 / 30



⇒ Non linear shrinkage: µ̂shrink =∑min{n, p}

l=1 ul ψ (dl ) v>l

• Shabalin & Nobel (2013); Gavish & Donoho (2014). Asymptoticn = np and p → ∞, np/p → β, 0 < β ≤ 1

ψ(dl ) =1dl

√(d2l − β − 1

)2 − 4β · 1(l ≥ (1+

√β))

• Verbank, Josse & Husson (2013). Asymptotic n, p fixed, σ → 0

ψ(dl ) = dl

(d2l − σ2

d2l

)· 1(l ≤ k)

signal variancetotal variance

ψ(dl ) =

(dl −

σ2

dl

)· 1(l ≤ k)

5 / 30



• Josse & Sardy (2014). Adaptive estimator. Finite sample.

ψ(dl ) = dl max(1− τγ

dγl, 0)

argminµ{‖X − µ‖22 + λ‖µ‖∗,w

}Selecting (τ, γ)? Data dependent: SURE - GSURE (σ unknown)

• Bayesian SVD models:

• Hoff (2009): uniform priors on U and V , dl ∼ N(0, s2

γ

)• Todeschini, et al. (2013): each dl has its s2

γl, hierarchical

6 / 30




Results

Other noise

7 / 30


SVD via the autoencoder

Model: X = µ+ ε, with εijiid∼N

(0, σ2)

⇒ Classical formulation of the least squares:

• LS: µ̂k = argminµ{‖X − µ‖22 : rank (µ) ≤ k

}• Truncated SVD: µ̂k =

∑kl=1 ul dl v>l

• Shrinkage to better recover µ (Rq: LS = Maximum likelihood- not the best in MSE)

⇒ Another formulation: autoencoder (Boulard & Kamp, 1988)

µ̂k = XBk , where Bk = argminB

{‖X − XB‖22 : rank (B) ≤ k

}

8 / 30


Parametric Bootstrap

⇒ Autoencoder: compress X


{‖X − XB‖22 : rank (B) ≤ k

}⇒ Aim: recover µ

µ̂∗k = XB∗k ,B∗k = argminB

{EX∼L(µ)

[‖µ− XB‖22

]: rank (B) ≤ k

}

⇒ Parametric Bootstrap

µ̂bootk = XB̂k , B̂k = argminB

{EX̃∼L(X )

[∥∥∥X − X̃B∥∥∥2

2

]: rank (B) ≤ k

}

9 / 30


Parametric Bootstrap

⇒ Autoencoder: compress X


{‖X − XB‖22 : rank (B) ≤ k

}⇒ Aim: recover µ


{EX∼L(µ)

[‖µ− XB‖22

]: rank (B) ≤ k

}⇒ Parametric Bootstrap


{EX̃∼L(X )

[∥∥∥X − X̃B∥∥∥2

2

]: rank (B) ≤ k

}

9 / 30


Stable Autoencoder

Model: X = µ+ ε, with εijiid∼N

(0, σ2)


{EX̃∼L(X )

[∥∥∥X − X̃B∥∥∥2

2

]: rkB ≤ k

}B̂k = argminB

{Eε[‖X − (X + ε)B‖22

]: rkB ≤ k

}

⇒ Solution: singular-value shrinkage estimator

µ̂bootk =k∑

l=1

uldl

1+ λ/d2lv>l with λ = nσ2

10 / 30


Feature Noising = Ridge = Shrinkage (Proof)

Eε[‖X − (X + ε)B‖22

]= ‖X − XB‖22 + Eε

[‖εB‖22

]= ‖X − XB‖22 +

∑i ,j ,k

B2ij Var

[εjk]

= ‖X − XB‖22 + nσ2 ‖B‖22

Let X = UDV> be the SVD of X , λ = nσ2

B̂λ = V B̃λV T , where B̃λ = argminB

{‖D − DB‖22 + λ ‖B‖22

}B̃ii = argminBii

{(1− Bii )

2 D2ii + λB2

ii

}=

D2ii

λ+ D2ii

µ̂(k)λ =

n∑i=1

Ui .Dii

1+ λ/D2iiV>i .

11 / 30


Feature Noising = Ridge = Shrinkage⇒ Bootstrap = Feature noising

µ̂bootk = XB̂k , argminB

{Eε[‖X − (X + ε)B‖22

]: rank (B) ≤ k

}⇒ Equivalent to a ridge autoencoder problem

µ̂(k)λ = XB̂λ, B̂λ = argminB

{‖X − XB‖22 + λ ‖B‖22

}: rank (B) ≤ k

Ridge estimator robust around round perturbation of the data

µ̂bootk =k∑

l=1

uldl

1+ λ/d2lv>l with λ = nσ2

µ̂(λ)k = XB̂λ, B̂λ =

(X>X + S

)−1X>X , S = diag(λ)

12 / 30


Feature noising and regularization in regression

Bishop (1995). Training with noise is equivalent to tikhonovregularization. Neural computation.

β̂ = argminβ

{Eεij

iid∼N (0, σ2)

[‖Y − (X + ε)β‖22

]}

Many noisy copies of the data and average out the auxiliary noise isequivalent to ridge regularization with λ = nσ2:

β̂(R)λ = argminβ

{‖Y − Xβ‖+ λ ‖β‖22

}

⇒ Control overfitting by artificially corrupting the training data

13 / 30


Drop-out training

Drop-out (Hinton et al. 2012) randomly omits subsets of featuresat each iteration of a training algorithm (improves neural networks)

Wager, et al. (2013). Dropout training as adaptive regularization.

• GLM: equivalence between noising schemes and regularization• Full potential with drop-out noise: Xij = 0 with prob δ andXij/(1− δ) → nice penalty (ex. logistic reg - rare features)

Wager, et al. (2014). Altitude training: bounds for single layer.

• "marathon runner who practices on altitude: once a classifierlearns to perform well on training examples corrupted bydropout, it will do very well on the uncorrupted test"

14 / 30


A centered parametric bootstrap

⇒ Aim: recover µ


{EX∼L(µ)

[‖µ− XB‖22

]: rank (B) ≤ k

}⇒ X as a proxy for µ


{EX̃∼L(X )

[∥∥∥X − X̃B∥∥∥2

2

]: rank (B) ≤ k

}

⇒ X is bigger than µ! µ̂ = XB̂ as a proxy for µ:

B̂ = argminB

{E ˜̂µ∼L(µ̂)

[∥∥∥µ̂− ˜̂µB∥∥∥2

2

]}

15 / 30


A centered parametric bootstrap

µ̂free = XB̂ B̂ = argminB

{E ˜̂µ∼L(µ̂)

[∥∥∥µ̂− ˜̂µB∥∥∥2

2

]}

Iterative algorithm. Initialization: µ̂ = X

1. µ̂ = XB̂

2. B̂ = argminB

{E ˜̂µ∼L(µ̂)

[∥∥∥µ̂− ˜̂µB∥∥∥2

2

]}1. µ̂ = XB̂2. B̂ = (µ̂>µ̂+ S)−1µ̂>µ̂

⇒ µ̂free will be automatically of low rank!!

16 / 30




Results

Other noise

17 / 30


Simulation design

The simulated data:X = µ+ ε

Vary several parameters:• n/p: n=200 p=500• rank k : 10, 100• the SNR ratio: 4, 2, 1, 0.5

Estimators:• TSVD: truncated SVD at k or τ∗ (Donoho & Gavish, 2013)• OS: optimal shrinkage (Shabalin & Nobel, 2013)• SVST: soft thresholding

• RIDGE: µ̂ = X (X ′X + diag(nσ2))−1X ′X , bootstrap with X• FREE: the iterative estimator, bootstrap with µ̂

18 / 30


Simulation results

k SNR NEW TSVD OS SVSTRIDGE FREE k τ

MSE10 4 0.004 0.004 0.004 0.004 0.004 0.008

100 4 0.037 0.036 0.038 0.038 0.037 0.04510 2 0.017 0.017 0.017 0.016 0.017 0.033

100 2 0.142 0.143 0.152 0.158 0.146 0.15610 1 0.067 0.067 0.072 0.072 0.067 0.116

100 1 0.511 0.775 0.733 0.856 0.600 0.44810 0.5 0.277 0.251 0.321 0.321 0.250 0.353

100 0.5 1.600 1.000 3.164 1.000 0.961 0.852Rank

10 4 10 10 10 65100 4 100 100 100 19310 2 10 10 10 63

100 2 100 100 100 18110 1 10 10 10 59

100 1 29.6 38 64 15410 0.5 10 10 10 51

100 0.5 0 0 15 86

19 / 30


Simulation results

⇒ Different noise regime

• low noise: TSVD (k is too big with soft)• moderate noise: OS, RIDGE, FREE• high noise (SNR low, k large): SVST

⇒ Adaptive estimator

⇒ FREE performs well in MSE and as a by-product estimatesaccurately the rank!

Rq: not the usual behavior n > p: rows more perturbed than thecolumns; n < p: columns more perturbed; n = p

20 / 30


Partial conclusion

⇒ Parametric Bootstrap: flexible framework for transforming noisemodel into regularized matrix estimator

⇒ Gaussian noise: singular value shrinkage

⇒ Gaussian noise is not always appropriate. Procedure is mostuseful outside the Gaussian framework

21 / 30




Results

Other noise

22 / 30


Count data

⇒ Model: X ∈ Rn×p ∼ L(µ) with E [X ] = µ of low-rank k

⇒ Poisson noise model: Xij ∼ Poisson (µij)⇒ Binomial noising: Xij ∼ 1

1−δ Binomial (µij , 1− δ)

⇒ µ̂k =∑k

l=1 ul dl v>l

⇒ Bootstap estimator = noising: X̃ij ∼ 11−δ Binomial (Xij , 1− δ)

µ̂boot = XB̂ B̂ = argminB

{EX̃∼L(X )

[∥∥∥X − X̃B∥∥∥2

2

]}⇒ Estimator robust to subsampling the obs used to build X

23 / 30


Bootstrap estimators

⇒ Feature noising = regularization

B̂ = argminB

{‖X − XB‖22 +

∥∥∥S 12B∥∥∥2

2

}Sjj =

n∑i=1

VarX̃∼L(X )

[X̃ij

]µ̂ = X (X>X + δ

1−δSR)−1X>X , SR diagonal with row-sums of X

⇒ New estimator µ̂ that does not reduce to singular value shrinkage

⇒ FREE estimator - iterative algorithm1. µ̂ = XB̂

2. B̂ = argminB

{E ˜̂µ∼L(µ̂)

[∥∥∥µ̂− ˜̂µB∥∥∥2

2

]}

24 / 30


Bootstrap estimators

⇒ Feature noising = regularization

B̂ = argminB

{‖X − XB‖22 +

∥∥∥S 12B∥∥∥2

2

}Sjj =

n∑i=1

VarX̃∼L(X )

[X̃ij

]µ̂ = X (X>X + δ

1−δSR)−1X>X , SR diagonal with row-sums of X

⇒ New estimator µ̂ that does not reduce to singular value shrinkage

⇒ FREE estimator - iterative algorithm1. µ̂ = XB̂2. B̂ = (µ̂>µ̂+ S)−1µ̂>µ̂

24 / 30


First results

Simulated data: Xij = Poisson (µij)

Design:• n = 20; p = 50• d1 = 489.49, d2 = 72.21, d3 = 6.75

Comparison:• RV coefficient (Escoufier, 1976) - correlation between matrices

MSE d1 d2 d3 RV U RV V kTSVD 1.57 489.28 73.47 20.46

SHRINK 1.44 489.14 72.51 16.83FREE 1.12 488.43 71.84 5.85 0.80 0.77 2.71

⇒ Singular vectors are not the same

25 / 30


Regularized Correspondence Analysis

⇒ Correspondence Analysis

M = R−12

(X − 1

Nrc>)C−

12 , where R = diag (r) , C = diag (c)

µ̂CAk = R

12 M̂k C

12 +

1Nrc>

⇒ Regularized CA:

X̃ij ∼1

1− δBinomial (Xij , 1− δ)

B̂λ = argminB

{‖M −MB‖22 +

δ

1− δ

∥∥∥∥S 12MB∥∥∥∥}

where SM is a diagonal with (SM)jj = c−1j∑n

i=1 Var[X̃ij

]/ri

26 / 30



⇒ Population data: perfume data set• 12 luxury perfumes described by 39 words - N=1075• d1 = 0.44, d2 = 0.15

⇒ Sample data: N = 400 - same proportion

d1 d2 RV U RV V k RV row RV colCA 0.59 0.31 0.83 0.75 0.91 0.71

SHRINK 0.35 0.10 0.83 0.75 0.93 0.74FREE 0.42 0.12 0.86 0.77 2 0.94 0.75

27 / 30



−2 −1 0 1 2 3

−1.

5−

0.5

0.5

1.5

CA

Dim 1 (60.18%)

Dim

2 (

39.8

2%)

floral

fruity

strong

light

sugary

freshsoap

vanilla

oldagressive

toilets alcohol

drugs

hot

peppery

rose

candy

vegetableeau.de.cologne

amber

Angel

Aromatics Elixir

Chanel 5

Cin.ma

Coco Mademoiselle

J_adore

J_adore_et

L_instant

Lolita Lempika

PleasuresPure Poison

Shalimar

−1 0 1 2

−1.

0−

0.5

0.0

0.5

1.0

Regularized CA

Dim 1 (62.43%)D

im 2

(37

.57%

)

floral

fruity

strong

cinema

light

sugary

fresh

soap

vanilla

old

toilets

alcohol

heavy

drugspeppery

rose

candy

musky

vegetable

amber

Angel

Aromatics Elixir

Chanel 5

Cin.ma

Coco Mademoiselle

J_adore

J_adore_et

L_instant

Lolita Lempika

Pleasures

Pure Poison

Shalimar

28 / 30



●

−1.0 −0.5 0.0 0.5 1.0 1.5

−1.

0−

0.5

0.0

0.5

1.0

Factor map

Dim 1 (65.41%)

Dim

2 (

34.5

9%)

J_adore

J_adore_et

Pleasures

cinema

Pure Poison

Coco Mademoiselle

Lolita Lempika

L_instant

Angel

Shalimar

Chanel 5

Aromatics Elixir

cluster 1

cluster 2

cluster 3

cluster 4

●

●

●

●

●

●

●

●

●

●

●

●

cluster 1 cluster 2 cluster 3 cluster 4

●

−0.5 0.0 0.5 1.0−

1.0

−0.

50.

00.

5

Factor map

Dim 1 (77.13%)

Dim

2 (

22.8

7%)

J_adore

J_adore_et

Pleasures

cinema

Pure Poison

Coco Mademoiselle

Lolita Lempika

L_instant

Angel

Shalimar

Chanel 5

Aromatics Elixir

cluster 1

cluster 2

cluster 3

●

●

●

●

●

●

●

●

●

●

●

●

cluster 1 cluster 2 cluster 3

29 / 30


Discussion

• FREE - good denoiser - automatic rank⇒ No free lunch: choosing σ - δ

• Convergence of the iterative algorithm?

• Regularized CA

• Extension tensor - Hoff (2013) Bayesian treatment of Tuckerdecomposition methods, hierarchical priors: an empiricalBayesian approach

⇒ SVD so many points of view !!

30 / 30

a bootstrap framework for low-rank matrix...

Documents