random matrices in machine learning - afiabasics of random matrix theory/motivation: large sample...

234
Random Matrices in Machine Learning Romain COUILLET CentraleSup´ elec, University of ParisSaclay, France GSTATS IDEX DataScience Chair, GIPSA-lab, University Grenoble–Alpes, France. June 21, 2018 1 / 113

Upload: others

Post on 24-Mar-2021

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Random Matrices in Machine Learning

Romain COUILLET

CentraleSupelec, University of ParisSaclay, FranceGSTATS IDEX DataScience Chair, GIPSA-lab, University Grenoble–Alpes, France.

June 21, 2018

1 / 113

Page 2: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Outline

Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models

ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√

p

Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs

Perspectives

2 / 113

Page 3: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Basics of Random Matrix Theory/ 3/113

Outline

Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models

ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√

p

Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs

Perspectives

3 / 113

Page 4: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 4/113

Outline

Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models

ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√

p

Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs

Perspectives

4 / 113

Page 5: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/113

Context

Baseline scenario: y1, . . . , yn ∈ Cp (or Rp) i.i.d. with E[y1] = 0, E[y1y∗1 ] = Cp:

I If y1 ∼ N (0, Cp), ML estimator for Cp is the sample covariance matrix (SCM)

Cp =1

nYpY

∗p =

1

n

n∑i=1

yiy∗i

(Yp = [y1, . . . , yn] ∈ Cp×n).I If n→∞, then, strong law of large numbers

Cpa.s.−→ Cp.

or equivalently, in spectral norm∥∥∥Cp − Cp∥∥∥ a.s.−→ 0.

Random Matrix Regime

I No longer valid if p, n→∞ with p/n→ c ∈ (0,∞),∥∥∥Cp − Cp∥∥∥ 6→ 0.

I For practical p, n with p ' n, leads to dramatically wrong conclusions

5 / 113

Page 6: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/113

Context

Baseline scenario: y1, . . . , yn ∈ Cp (or Rp) i.i.d. with E[y1] = 0, E[y1y∗1 ] = Cp:I If y1 ∼ N (0, Cp), ML estimator for Cp is the sample covariance matrix (SCM)

Cp =1

nYpY

∗p =

1

n

n∑i=1

yiy∗i

(Yp = [y1, . . . , yn] ∈ Cp×n).

I If n→∞, then, strong law of large numbers

Cpa.s.−→ Cp.

or equivalently, in spectral norm∥∥∥Cp − Cp∥∥∥ a.s.−→ 0.

Random Matrix Regime

I No longer valid if p, n→∞ with p/n→ c ∈ (0,∞),∥∥∥Cp − Cp∥∥∥ 6→ 0.

I For practical p, n with p ' n, leads to dramatically wrong conclusions

5 / 113

Page 7: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/113

Context

Baseline scenario: y1, . . . , yn ∈ Cp (or Rp) i.i.d. with E[y1] = 0, E[y1y∗1 ] = Cp:I If y1 ∼ N (0, Cp), ML estimator for Cp is the sample covariance matrix (SCM)

Cp =1

nYpY

∗p =

1

n

n∑i=1

yiy∗i

(Yp = [y1, . . . , yn] ∈ Cp×n).I If n→∞, then, strong law of large numbers

Cpa.s.−→ Cp.

or equivalently, in spectral norm∥∥∥Cp − Cp∥∥∥ a.s.−→ 0.

Random Matrix Regime

I No longer valid if p, n→∞ with p/n→ c ∈ (0,∞),∥∥∥Cp − Cp∥∥∥ 6→ 0.

I For practical p, n with p ' n, leads to dramatically wrong conclusions

5 / 113

Page 8: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/113

Context

Baseline scenario: y1, . . . , yn ∈ Cp (or Rp) i.i.d. with E[y1] = 0, E[y1y∗1 ] = Cp:I If y1 ∼ N (0, Cp), ML estimator for Cp is the sample covariance matrix (SCM)

Cp =1

nYpY

∗p =

1

n

n∑i=1

yiy∗i

(Yp = [y1, . . . , yn] ∈ Cp×n).I If n→∞, then, strong law of large numbers

Cpa.s.−→ Cp.

or equivalently, in spectral norm∥∥∥Cp − Cp∥∥∥ a.s.−→ 0.

Random Matrix Regime

I No longer valid if p, n→∞ with p/n→ c ∈ (0,∞),∥∥∥Cp − Cp∥∥∥ 6→ 0.

I For practical p, n with p ' n, leads to dramatically wrong conclusions

5 / 113

Page 9: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 5/113

Context

Baseline scenario: y1, . . . , yn ∈ Cp (or Rp) i.i.d. with E[y1] = 0, E[y1y∗1 ] = Cp:I If y1 ∼ N (0, Cp), ML estimator for Cp is the sample covariance matrix (SCM)

Cp =1

nYpY

∗p =

1

n

n∑i=1

yiy∗i

(Yp = [y1, . . . , yn] ∈ Cp×n).I If n→∞, then, strong law of large numbers

Cpa.s.−→ Cp.

or equivalently, in spectral norm∥∥∥Cp − Cp∥∥∥ a.s.−→ 0.

Random Matrix Regime

I No longer valid if p, n→∞ with p/n→ c ∈ (0,∞),∥∥∥Cp − Cp∥∥∥ 6→ 0.

I For practical p, n with p ' n, leads to dramatically wrong conclusions

5 / 113

Page 10: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 6/113

The Marcenko–Pastur law

0 0.5 1 1.5 2 2.5 30

0.2

0.4

0.6

0.8

Eigenvalues of Cp

Den

sity

Empirical eigenvalue distribution

Marcenko–Pastur Law

Figure: Histogram of the eigenvalues of Cp for p = 500, n = 2000, Cp = Ip.

6 / 113

Page 11: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113

The Marcenko–Pastur law

Definition (Empirical Spectral Density)Empirical spectral density (e.s.d.) µp of Hermitian matrix Ap ∈ Cp×p is

µp =1

p

p∑i=1

δλi(Ap).

Theorem (Marcenko–Pastur Law [Marcenko,Pastur’67])Xp ∈ Cp×n with i.i.d. zero mean, unit variance entries.As p, n→∞ with p/n→ c ∈ (0,∞), e.s.d. µp of 1

nXpX∗p satisfies

µpa.s.−→ µc

weakly, where

I µc(0) = max0, 1− c−1I on (0,∞), µc has continuous density fc supported on [(1−

√c)2, (1 +

√c)2]

fc(x) =1

2πcx

√(x− (1−

√c)2)((1 +

√c)2 − x).

7 / 113

Page 12: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113

The Marcenko–Pastur law

Definition (Empirical Spectral Density)Empirical spectral density (e.s.d.) µp of Hermitian matrix Ap ∈ Cp×p is

µp =1

p

p∑i=1

δλi(Ap).

Theorem (Marcenko–Pastur Law [Marcenko,Pastur’67])Xp ∈ Cp×n with i.i.d. zero mean, unit variance entries.As p, n→∞ with p/n→ c ∈ (0,∞), e.s.d. µp of 1

nXpX∗p satisfies

µpa.s.−→ µc

weakly, where

I µc(0) = max0, 1− c−1

I on (0,∞), µc has continuous density fc supported on [(1−√c)2, (1 +

√c)2]

fc(x) =1

2πcx

√(x− (1−

√c)2)((1 +

√c)2 − x).

7 / 113

Page 13: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113

The Marcenko–Pastur law

Definition (Empirical Spectral Density)Empirical spectral density (e.s.d.) µp of Hermitian matrix Ap ∈ Cp×p is

µp =1

p

p∑i=1

δλi(Ap).

Theorem (Marcenko–Pastur Law [Marcenko,Pastur’67])Xp ∈ Cp×n with i.i.d. zero mean, unit variance entries.As p, n→∞ with p/n→ c ∈ (0,∞), e.s.d. µp of 1

nXpX∗p satisfies

µpa.s.−→ µc

weakly, where

I µc(0) = max0, 1− c−1I on (0,∞), µc has continuous density fc supported on [(1−

√c)2, (1 +

√c)2]

fc(x) =1

2πcx

√(x− (1−

√c)2)((1 +

√c)2 − x).

7 / 113

Page 14: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/113

The Marcenko–Pastur law

0 0.5 1 1.5 2 2.5 30

0.2

0.4

0.6

0.8

1

1.2

x

Den

sityfc(x

)

c = 0.1

Figure: Marcenko-Pastur law for different limit ratios c = limp→∞ p/n.

8 / 113

Page 15: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/113

The Marcenko–Pastur law

0 0.5 1 1.5 2 2.5 30

0.2

0.4

0.6

0.8

1

1.2

x

Den

sityfc(x

)

c = 0.1

c = 0.2

Figure: Marcenko-Pastur law for different limit ratios c = limp→∞ p/n.

8 / 113

Page 16: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Basics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 8/113

The Marcenko–Pastur law

0 0.5 1 1.5 2 2.5 30

0.2

0.4

0.6

0.8

1

1.2

x

Den

sityfc(x

)

c = 0.1

c = 0.2

c = 0.5

Figure: Marcenko-Pastur law for different limit ratios c = limp→∞ p/n.

8 / 113

Page 17: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Basics of Random Matrix Theory/Spiked Models 9/113

Outline

Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models

ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√

p

Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs

Perspectives

9 / 113

Page 18: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Basics of Random Matrix Theory/Spiked Models 10/113

Spiked Models

Small rank perturbation: Cp = Ip + P , P of low rank.

1 + ω1 + c1+ω1ω1

, 1 + ω2 + c1+ω2ω2

0

0.2

0.4

0.6

0.8

λini=1

µ

Figure: Eigenvalues of 1nYpY

∗p , Cp = diag(1, . . . , 1︸ ︷︷ ︸

p−4

, 2, 2, 3, 3), p = 500, n = 1500.

10 / 113

Page 19: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Basics of Random Matrix Theory/Spiked Models 11/113

Spiked Models

Theorem (Eigenvalues [Baik,Silverstein’06])

Let Yp = C12p Xp, with

I Xp with i.i.d. zero mean, unit variance, E[|Xp|4ij ] <∞.

I Cp = Ip + P , P = UΩU∗, where, for K fixed,

Ω = diag (ω1, . . . , ωK) ∈ RK×K , with ω1 ≥ . . . ≥ ωK > 0.

Then, as p, n→∞, p/n→ c ∈ (0,∞), denoting λm = λm( 1nYpY ∗p ) (λm > λm+1),

λma.s.−→

1 + ωm + c 1+ωm

ωm> (1 +

√c)2 , ωm >

√c

(1 +√c)2 , ωm ∈ (0,

√c].

11 / 113

Page 20: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Basics of Random Matrix Theory/Spiked Models 11/113

Spiked Models

Theorem (Eigenvalues [Baik,Silverstein’06])

Let Yp = C12p Xp, with

I Xp with i.i.d. zero mean, unit variance, E[|Xp|4ij ] <∞.

I Cp = Ip + P , P = UΩU∗, where, for K fixed,

Ω = diag (ω1, . . . , ωK) ∈ RK×K , with ω1 ≥ . . . ≥ ωK > 0.

Then, as p, n→∞, p/n→ c ∈ (0,∞), denoting λm = λm( 1nYpY ∗p ) (λm > λm+1),

λma.s.−→

1 + ωm + c 1+ωm

ωm> (1 +

√c)2 , ωm >

√c

(1 +√c)2 , ωm ∈ (0,

√c].

11 / 113

Page 21: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Basics of Random Matrix Theory/Spiked Models 12/113

Spiked Models

Theorem (Eigenvectors [Paul’07])

Let Yp = C12p Xp, with

I Xp with i.i.d. zero mean, unit variance, E[|Xp|4ij ] <∞.

I Cp = Ip + P , P = UΩU∗ =∑Ki=1 ωiuiu

∗i , ω1 > . . . > ωM > 0.

Then, as p, n→∞, p/n→ c ∈ (0,∞), for a, b ∈ Cp deterministic and ui eigenvectorof λi(

1nYpY ∗p ),

a∗uiu∗i b−

1− cω−2i

1 + cω−1i

a∗uiu∗i b · 1ωi>

√c

a.s.−→ 0

In particular,

|u∗i ui|2a.s.−→

1− cω−2i

1 + cω−1i

· 1ωi>√c.

12 / 113

Page 22: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Basics of Random Matrix Theory/Spiked Models 12/113

Spiked Models

Theorem (Eigenvectors [Paul’07])

Let Yp = C12p Xp, with

I Xp with i.i.d. zero mean, unit variance, E[|Xp|4ij ] <∞.

I Cp = Ip + P , P = UΩU∗ =∑Ki=1 ωiuiu

∗i , ω1 > . . . > ωM > 0.

Then, as p, n→∞, p/n→ c ∈ (0,∞), for a, b ∈ Cp deterministic and ui eigenvectorof λi(

1nYpY ∗p ),

a∗uiu∗i b−

1− cω−2i

1 + cω−1i

a∗uiu∗i b · 1ωi>

√c

a.s.−→ 0

In particular,

|u∗i ui|2a.s.−→

1− cω−2i

1 + cω−1i

· 1ωi>√c.

12 / 113

Page 23: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Basics of Random Matrix Theory/Spiked Models 13/113

Spiked Models

0 1 2 3 40

0.2

0.4

0.6

0.8

1

Population spike ω1

|u∗ 1u

1|2

p = 100

p = 200

p = 400

1−c/ω21

1+c/ω1

Figure: Simulated versus limiting |u∗1u1|2 for Yp = C12p Xp, Cp = Ip + ω1u1u

∗1 , p/n = 1/3,

varying ω1.

13 / 113

Page 24: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Basics of Random Matrix Theory/Spiked Models 14/113

Other Spiked Models

Similar results for multiple matrix models:

I Yp = 1n

(I + P )12XpX∗p (I + P )

12

I Yp = 1nXpX∗p + P

I Yp = 1nX∗p (I + P )X

I Yp = 1n

(Xp + P )∗(Xp + P )

I etc.

14 / 113

Page 25: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/ 15/113

Outline

Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models

ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√

p

Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs

Perspectives

15 / 113

Page 26: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Reminder on Spectral Clustering Methods 16/113

Outline

Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models

ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√

p

Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs

Perspectives

16 / 113

Page 27: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Reminder on Spectral Clustering Methods 17/113

Reminder on Spectral Clustering Methods

Context: Two-step classification of n objects based on similarity A ∈ Rn×n:

0 spikes

⇓ Eigenvectors ⇓(in practice, shuffled)

Eig

env.

1E

igen

v.2

17 / 113

Page 28: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Reminder on Spectral Clustering Methods 17/113

Reminder on Spectral Clustering Methods

Context: Two-step classification of n objects based on similarity A ∈ Rn×n:

0 spikes

⇓ Eigenvectors ⇓(in practice, shuffled)

Eig

env.

1E

igen

v.2

17 / 113

Page 29: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Reminder on Spectral Clustering Methods 18/113

Reminder on Spectral Clustering Methods

Eig

env.

1E

igen

v.2

⇓ `-dimensional representation ⇓(shuffling no longer matters)

Eigenvector 1

Eig

enve

ctor

2

⇓EM or k-means clustering.

18 / 113

Page 30: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Reminder on Spectral Clustering Methods 18/113

Reminder on Spectral Clustering Methods

Eig

env.

1E

igen

v.2

⇓ `-dimensional representation ⇓(shuffling no longer matters)

Eigenvector 1

Eig

enve

ctor

2

⇓EM or k-means clustering.

18 / 113

Page 31: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Reminder on Spectral Clustering Methods 18/113

Reminder on Spectral Clustering Methods

Eig

env.

1E

igen

v.2

⇓ `-dimensional representation ⇓(shuffling no longer matters)

Eigenvector 1

Eig

enve

ctor

2

⇓EM or k-means clustering.

18 / 113

Page 32: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 19/113

Outline

Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models

ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√

p

Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs

Perspectives

19 / 113

Page 33: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 20/113

Kernel Spectral Clustering

Problem StatementI Dataset x1, . . . , xn ∈ RpI Objective: “cluster” data in k similarity classes C1, . . . , Ck.

I Kernel spectral clustering based on kernel matrix

K = κ(xi, xj)ni,j=1

I Usually, κ(x, y) = f(xTy) or κ(x, y) = f(‖x− y‖2)I Refinements:

I instead of K, use D −K, In −D−1K, In −D−12KD−

12 , etc.

I several steps algorithms: Ng–Jordan–Weiss, Shi–Malik, etc.

Intuition (from small dimensions)

I K essentially low rank with class structure in eigenvectors.

I Ng–Weiss–Jordan key remark: D−12KD−

12 (D

12 ja) ' D

12 ja (ja canonical

vector of Ca)

20 / 113

Page 34: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 20/113

Kernel Spectral Clustering

Problem StatementI Dataset x1, . . . , xn ∈ RpI Objective: “cluster” data in k similarity classes C1, . . . , Ck.

I Kernel spectral clustering based on kernel matrix

K = κ(xi, xj)ni,j=1

I Usually, κ(x, y) = f(xTy) or κ(x, y) = f(‖x− y‖2)I Refinements:

I instead of K, use D −K, In −D−1K, In −D−12KD−

12 , etc.

I several steps algorithms: Ng–Jordan–Weiss, Shi–Malik, etc.

Intuition (from small dimensions)

I K essentially low rank with class structure in eigenvectors.

I Ng–Weiss–Jordan key remark: D−12KD−

12 (D

12 ja) ' D

12 ja (ja canonical

vector of Ca)

20 / 113

Page 35: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 20/113

Kernel Spectral Clustering

Problem StatementI Dataset x1, . . . , xn ∈ RpI Objective: “cluster” data in k similarity classes C1, . . . , Ck.

I Kernel spectral clustering based on kernel matrix

K = κ(xi, xj)ni,j=1

I Usually, κ(x, y) = f(xTy) or κ(x, y) = f(‖x− y‖2)

I Refinements:I instead of K, use D −K, In −D−1K, In −D−

12KD−

12 , etc.

I several steps algorithms: Ng–Jordan–Weiss, Shi–Malik, etc.

Intuition (from small dimensions)

I K essentially low rank with class structure in eigenvectors.

I Ng–Weiss–Jordan key remark: D−12KD−

12 (D

12 ja) ' D

12 ja (ja canonical

vector of Ca)

20 / 113

Page 36: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 20/113

Kernel Spectral Clustering

Problem StatementI Dataset x1, . . . , xn ∈ RpI Objective: “cluster” data in k similarity classes C1, . . . , Ck.

I Kernel spectral clustering based on kernel matrix

K = κ(xi, xj)ni,j=1

I Usually, κ(x, y) = f(xTy) or κ(x, y) = f(‖x− y‖2)I Refinements:

I instead of K, use D −K, In −D−1K, In −D−12KD−

12 , etc.

I several steps algorithms: Ng–Jordan–Weiss, Shi–Malik, etc.

Intuition (from small dimensions)

I K essentially low rank with class structure in eigenvectors.

I Ng–Weiss–Jordan key remark: D−12KD−

12 (D

12 ja) ' D

12 ja (ja canonical

vector of Ca)

20 / 113

Page 37: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 20/113

Kernel Spectral Clustering

Problem StatementI Dataset x1, . . . , xn ∈ RpI Objective: “cluster” data in k similarity classes C1, . . . , Ck.

I Kernel spectral clustering based on kernel matrix

K = κ(xi, xj)ni,j=1

I Usually, κ(x, y) = f(xTy) or κ(x, y) = f(‖x− y‖2)I Refinements:

I instead of K, use D −K, In −D−1K, In −D−12KD−

12 , etc.

I several steps algorithms: Ng–Jordan–Weiss, Shi–Malik, etc.

Intuition (from small dimensions)

I K essentially low rank with class structure in eigenvectors.

I Ng–Weiss–Jordan key remark: D−12KD−

12 (D

12 ja) ' D

12 ja (ja canonical

vector of Ca)

20 / 113

Page 38: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 20/113

Kernel Spectral Clustering

Problem StatementI Dataset x1, . . . , xn ∈ RpI Objective: “cluster” data in k similarity classes C1, . . . , Ck.

I Kernel spectral clustering based on kernel matrix

K = κ(xi, xj)ni,j=1

I Usually, κ(x, y) = f(xTy) or κ(x, y) = f(‖x− y‖2)I Refinements:

I instead of K, use D −K, In −D−1K, In −D−12KD−

12 , etc.

I several steps algorithms: Ng–Jordan–Weiss, Shi–Malik, etc.

Intuition (from small dimensions)

I K essentially low rank with class structure in eigenvectors.

I Ng–Weiss–Jordan key remark: D−12KD−

12 (D

12 ja) ' D

12 ja (ja canonical

vector of Ca)

20 / 113

Page 39: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 21/113

Kernel Spectral Clustering

−0.08

−0.07

−0.06

−0.1

0

0.1

−0.1

0

0.1

0.2

−0.2−0.1

00.10.2

Figure: Leading four eigenvectors of D−12KD−

12 for MNIST data, RBF kernel

(f(t) = exp(−t2/2)).

I Important Remark: eigenvectors informative BUT far from D12 ja!

21 / 113

Page 40: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 21/113

Kernel Spectral Clustering

−0.08

−0.07

−0.06

−0.1

0

0.1

−0.1

0

0.1

0.2

−0.2−0.1

00.10.2

Figure: Leading four eigenvectors of D−12KD−

12 for MNIST data, RBF kernel

(f(t) = exp(−t2/2)).

I Important Remark: eigenvectors informative BUT far from D12 ja!

21 / 113

Page 41: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 21/113

Kernel Spectral Clustering

−0.08

−0.07

−0.06

−0.1

0

0.1

−0.1

0

0.1

0.2

−0.2−0.1

00.10.2

Figure: Leading four eigenvectors of D−12KD−

12 for MNIST data, RBF kernel

(f(t) = exp(−t2/2)).

I Important Remark: eigenvectors informative BUT far from D12 ja!

21 / 113

Page 42: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 21/113

Kernel Spectral Clustering

−0.08

−0.07

−0.06

−0.1

0

0.1

−0.1

0

0.1

0.2

−0.2−0.1

00.10.2

Figure: Leading four eigenvectors of D−12KD−

12 for MNIST data, RBF kernel

(f(t) = exp(−t2/2)).

I Important Remark: eigenvectors informative BUT far from D12 ja!

21 / 113

Page 43: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 21/113

Kernel Spectral Clustering

−0.08

−0.07

−0.06

−0.1

0

0.1

−0.1

0

0.1

0.2

−0.2−0.1

00.10.2

Figure: Leading four eigenvectors of D−12KD−

12 for MNIST data, RBF kernel

(f(t) = exp(−t2/2)).

I Important Remark: eigenvectors informative BUT far from D12 ja!

21 / 113

Page 44: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 22/113

Model and Assumptions

Gaussian mixture model:I x1, . . . , xn ∈ Rp,I k classes C1, . . . , Ck,I x1, . . . , xn1 ∈ C1, . . . , xn−nk+1, . . . , xn ∈ Ck,I xi ∼ N (µgi , Cgi ).

Assumption (Growth Rate)As n→∞,

1. Data scaling: pn→ c0 ∈ (0,∞), na

n→ ca ∈ (0, 1),

2. Mean scaling: with µ ,∑ka=1

nanµa and µa , µa − µ, then ‖µa‖ = O(1)

3. Covariance scaling: with C ,∑ka=1

nanCa and Ca , Ca − C, then

‖Ca‖ = O(1), trCa = O(√p), trCaC

b = O(p)

For 2 classes, this is

‖µ1 − µ2‖ = O(1), tr (C1 − C2) = O(√p), ‖Ci‖ = O(1), tr ([C1 − C2]2) = O(p).

Remark: [Neyman–Pearson optimality]I x ∼ N (±µ, Ip) (known µ) decidable iif ‖µ‖ ≥ O(1).

I x ∼ N (0, (1± ε)Ip) (known ε) decidable iif ‖ε‖ ≥ O(p−12 ).

22 / 113

Page 45: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 22/113

Model and Assumptions

Gaussian mixture model:I x1, . . . , xn ∈ Rp,I k classes C1, . . . , Ck,I x1, . . . , xn1 ∈ C1, . . . , xn−nk+1, . . . , xn ∈ Ck,I xi ∼ N (µgi , Cgi ).

Assumption (Growth Rate)As n→∞,

1. Data scaling: pn→ c0 ∈ (0,∞), na

n→ ca ∈ (0, 1),

2. Mean scaling: with µ ,∑ka=1

nanµa and µa , µa − µ, then ‖µa‖ = O(1)

3. Covariance scaling: with C ,∑ka=1

nanCa and Ca , Ca − C, then

‖Ca‖ = O(1), trCa = O(√p), trCaC

b = O(p)

For 2 classes, this is

‖µ1 − µ2‖ = O(1), tr (C1 − C2) = O(√p), ‖Ci‖ = O(1), tr ([C1 − C2]2) = O(p).

Remark: [Neyman–Pearson optimality]I x ∼ N (±µ, Ip) (known µ) decidable iif ‖µ‖ ≥ O(1).

I x ∼ N (0, (1± ε)Ip) (known ε) decidable iif ‖ε‖ ≥ O(p−12 ).

22 / 113

Page 46: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 22/113

Model and Assumptions

Gaussian mixture model:I x1, . . . , xn ∈ Rp,I k classes C1, . . . , Ck,I x1, . . . , xn1 ∈ C1, . . . , xn−nk+1, . . . , xn ∈ Ck,I xi ∼ N (µgi , Cgi ).

Assumption (Growth Rate)As n→∞,

1. Data scaling: pn→ c0 ∈ (0,∞), na

n→ ca ∈ (0, 1),

2. Mean scaling: with µ ,∑ka=1

nanµa and µa , µa − µ, then ‖µa‖ = O(1)

3. Covariance scaling: with C ,∑ka=1

nanCa and Ca , Ca − C, then

‖Ca‖ = O(1), trCa = O(√p), trCaC

b = O(p)

For 2 classes, this is

‖µ1 − µ2‖ = O(1), tr (C1 − C2) = O(√p), ‖Ci‖ = O(1), tr ([C1 − C2]2) = O(p).

Remark: [Neyman–Pearson optimality]I x ∼ N (±µ, Ip) (known µ) decidable iif ‖µ‖ ≥ O(1).

I x ∼ N (0, (1± ε)Ip) (known ε) decidable iif ‖ε‖ ≥ O(p−12 ).

22 / 113

Page 47: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 22/113

Model and Assumptions

Gaussian mixture model:I x1, . . . , xn ∈ Rp,I k classes C1, . . . , Ck,I x1, . . . , xn1 ∈ C1, . . . , xn−nk+1, . . . , xn ∈ Ck,I xi ∼ N (µgi , Cgi ).

Assumption (Growth Rate)As n→∞,

1. Data scaling: pn→ c0 ∈ (0,∞), na

n→ ca ∈ (0, 1),

2. Mean scaling: with µ ,∑ka=1

nanµa and µa , µa − µ, then ‖µa‖ = O(1)

3. Covariance scaling: with C ,∑ka=1

nanCa and Ca , Ca − C, then

‖Ca‖ = O(1), trCa = O(√p), trCaC

b = O(p)

For 2 classes, this is

‖µ1 − µ2‖ = O(1), tr (C1 − C2) = O(√p), ‖Ci‖ = O(1), tr ([C1 − C2]2) = O(p).

Remark: [Neyman–Pearson optimality]I x ∼ N (±µ, Ip) (known µ) decidable iif ‖µ‖ ≥ O(1).

I x ∼ N (0, (1± ε)Ip) (known ε) decidable iif ‖ε‖ ≥ O(p−12 ).

22 / 113

Page 48: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 23/113

Model and Assumptions

Kernel Matrix:

I Kernel matrix of interest:

K =

f

(1

p‖xi − xj‖2

)ni,j=1

for some sufficiently smooth nonnegative f (f( 1pxTi xj) simpler).

I We study the normalized Laplacian:

L = nD−12

(K −

ddT

dT1n

)D−

12

with d = K1n, D = diag(d).(more stable both theoretically and in practice)

23 / 113

Page 49: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 23/113

Model and Assumptions

Kernel Matrix:

I Kernel matrix of interest:

K =

f

(1

p‖xi − xj‖2

)ni,j=1

for some sufficiently smooth nonnegative f (f( 1pxTi xj) simpler).

I We study the normalized Laplacian:

L = nD−12

(K −

ddT

dT1n

)D−

12

with d = K1n, D = diag(d).(more stable both theoretically and in practice)

23 / 113

Page 50: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 24/113

Random Matrix Equivalent

I Key Remark: Under growth rate assumptions,

max1≤i6=j≤n

∣∣∣∣1p ‖xi − xj‖2 − τ∣∣∣∣ a.s.−→ 0.

where τ = 1p

trC.

⇒ Suggests that (up to diagonal) K ' f(τ)1n1Tn!

I In fact, information hidden in low order fluctuations! from “matrix-wise” Taylorexpansion of K:

K = f(τ)1n1Tn︸ ︷︷ ︸

O‖·‖(n)

+√nK1︸ ︷︷ ︸

low rank, O‖·‖(√n)

+ K2︸︷︷︸informative terms, O‖·‖(1)

Clearly not the (small dimension) expected behavior.

24 / 113

Page 51: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 24/113

Random Matrix Equivalent

I Key Remark: Under growth rate assumptions,

max1≤i6=j≤n

∣∣∣∣1p ‖xi − xj‖2 − τ∣∣∣∣ a.s.−→ 0.

where τ = 1p

trC.

⇒ Suggests that (up to diagonal) K ' f(τ)1n1Tn!

I In fact, information hidden in low order fluctuations! from “matrix-wise” Taylorexpansion of K:

K = f(τ)1n1Tn︸ ︷︷ ︸

O‖·‖(n)

+√nK1︸ ︷︷ ︸

low rank, O‖·‖(√n)

+ K2︸︷︷︸informative terms, O‖·‖(1)

Clearly not the (small dimension) expected behavior.

24 / 113

Page 52: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 24/113

Random Matrix Equivalent

I Key Remark: Under growth rate assumptions,

max1≤i6=j≤n

∣∣∣∣1p ‖xi − xj‖2 − τ∣∣∣∣ a.s.−→ 0.

where τ = 1p

trC.

⇒ Suggests that (up to diagonal) K ' f(τ)1n1Tn!

I In fact, information hidden in low order fluctuations! from “matrix-wise” Taylorexpansion of K:

K = f(τ)1n1Tn︸ ︷︷ ︸

O‖·‖(n)

+√nK1︸ ︷︷ ︸

low rank, O‖·‖(√n)

+ K2︸︷︷︸informative terms, O‖·‖(1)

Clearly not the (small dimension) expected behavior.

24 / 113

Page 53: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 24/113

Random Matrix Equivalent

I Key Remark: Under growth rate assumptions,

max1≤i6=j≤n

∣∣∣∣1p ‖xi − xj‖2 − τ∣∣∣∣ a.s.−→ 0.

where τ = 1p

trC.

⇒ Suggests that (up to diagonal) K ' f(τ)1n1Tn!

I In fact, information hidden in low order fluctuations! from “matrix-wise” Taylorexpansion of K:

K = f(τ)1n1Tn︸ ︷︷ ︸

O‖·‖(n)

+√nK1︸ ︷︷ ︸

low rank, O‖·‖(√n)

+ K2︸︷︷︸informative terms, O‖·‖(1)

Clearly not the (small dimension) expected behavior.

24 / 113

Page 54: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 25/113

Random Matrix Equivalent

Theorem (Random Matrix Equivalent [Couillet, Benaych’2015])As n, p→∞,

∥∥∥L− L∥∥∥ a.s.−→ 0, where

L = nD−12

(K −

ddT

dT1n

)D−

12 , avec Kij = f

(1

p‖xi − xj‖2

)L = −2

f ′(τ)

f(τ)

[1

pPWTWP +

1

pJBJT + ∗

]et W = [w1, . . . , wn] ∈ Rp×n (xi = µa + wi), P = In − 1

n1n1T

n,

J = [j1, . . . , jk], jTa = (0 . . . 0, 1na , 0, . . . , 0)

B = MTM +

(5f ′(τ)

8f(τ)−f ′′(τ)

2f ′(τ)

)ttT −

f ′′(τ)

f ′(τ)T + ∗.

Recall M = [µ1, . . . , µk], t = [ 1√

ptrC1 , . . . ,

1√p

trCk ]T, T =

1p

trCaCb

ka,b=1

.

Fundamental conclusions:

I asymptotic kernel impact only through f ′(τ) and f ′′(τ), that’s all!

I spectral clustering reads MTM , ttT and T , that’s all!

25 / 113

Page 55: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 25/113

Random Matrix Equivalent

Theorem (Random Matrix Equivalent [Couillet, Benaych’2015])As n, p→∞,

∥∥∥L− L∥∥∥ a.s.−→ 0, where

L = nD−12

(K −

ddT

dT1n

)D−

12 , avec Kij = f

(1

p‖xi − xj‖2

)L = −2

f ′(τ)

f(τ)

[1

pPWTWP +

1

pJBJT + ∗

]et W = [w1, . . . , wn] ∈ Rp×n (xi = µa + wi), P = In − 1

n1n1T

n,

J = [j1, . . . , jk], jTa = (0 . . . 0, 1na , 0, . . . , 0)

B = MTM +

(5f ′(τ)

8f(τ)−f ′′(τ)

2f ′(τ)

)ttT −

f ′′(τ)

f ′(τ)T + ∗.

Recall M = [µ1, . . . , µk], t = [ 1√

ptrC1 , . . . ,

1√p

trCk ]T, T =

1p

trCaCb

ka,b=1

.

Fundamental conclusions:

I asymptotic kernel impact only through f ′(τ) and f ′′(τ), that’s all!

I spectral clustering reads MTM , ttT and T , that’s all!

25 / 113

Page 56: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 25/113

Random Matrix Equivalent

Theorem (Random Matrix Equivalent [Couillet, Benaych’2015])As n, p→∞,

∥∥∥L− L∥∥∥ a.s.−→ 0, where

L = nD−12

(K −

ddT

dT1n

)D−

12 , avec Kij = f

(1

p‖xi − xj‖2

)L = −2

f ′(τ)

f(τ)

[1

pPWTWP +

1

pJBJT + ∗

]et W = [w1, . . . , wn] ∈ Rp×n (xi = µa + wi), P = In − 1

n1n1T

n,

J = [j1, . . . , jk], jTa = (0 . . . 0, 1na , 0, . . . , 0)

B = MTM +

(5f ′(τ)

8f(τ)−f ′′(τ)

2f ′(τ)

)ttT −

f ′′(τ)

f ′(τ)T + ∗.

Recall M = [µ1, . . . , µk], t = [ 1√

ptrC1 , . . . ,

1√p

trCk ]T, T =

1p

trCaCb

ka,b=1

.

Fundamental conclusions:

I asymptotic kernel impact only through f ′(τ) and f ′′(τ), that’s all!

I spectral clustering reads MTM , ttT and T , that’s all!

25 / 113

Page 57: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 25/113

Random Matrix Equivalent

Theorem (Random Matrix Equivalent [Couillet, Benaych’2015])As n, p→∞,

∥∥∥L− L∥∥∥ a.s.−→ 0, where

L = nD−12

(K −

ddT

dT1n

)D−

12 , avec Kij = f

(1

p‖xi − xj‖2

)L = −2

f ′(τ)

f(τ)

[1

pPWTWP +

1

pJBJT + ∗

]et W = [w1, . . . , wn] ∈ Rp×n (xi = µa + wi), P = In − 1

n1n1T

n,

J = [j1, . . . , jk], jTa = (0 . . . 0, 1na , 0, . . . , 0)

B = MTM +

(5f ′(τ)

8f(τ)−f ′′(τ)

2f ′(τ)

)ttT −

f ′′(τ)

f ′(τ)T + ∗.

Recall M = [µ1, . . . , µk], t = [ 1√

ptrC1 , . . . ,

1√p

trCk ]T, T =

1p

trCaCb

ka,b=1

.

Fundamental conclusions:

I asymptotic kernel impact only through f ′(τ) and f ′′(τ), that’s all!

I spectral clustering reads MTM , ttT and T , that’s all!

25 / 113

Page 58: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 25/113

Random Matrix Equivalent

Theorem (Random Matrix Equivalent [Couillet, Benaych’2015])As n, p→∞,

∥∥∥L− L∥∥∥ a.s.−→ 0, where

L = nD−12

(K −

ddT

dT1n

)D−

12 , avec Kij = f

(1

p‖xi − xj‖2

)L = −2

f ′(τ)

f(τ)

[1

pPWTWP +

1

pJBJT + ∗

]et W = [w1, . . . , wn] ∈ Rp×n (xi = µa + wi), P = In − 1

n1n1T

n,

J = [j1, . . . , jk], jTa = (0 . . . 0, 1na , 0, . . . , 0)

B = MTM +

(5f ′(τ)

8f(τ)−f ′′(τ)

2f ′(τ)

)ttT −

f ′′(τ)

f ′(τ)T + ∗.

Recall M = [µ1, . . . , µk], t = [ 1√

ptrC1 , . . . ,

1√p

trCk ]T, T =

1p

trCaCb

ka,b=1

.

Fundamental conclusions:

I asymptotic kernel impact only through f ′(τ) and f ′′(τ), that’s all!

I spectral clustering reads MTM , ttT and T , that’s all!

25 / 113

Page 59: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 26/113

Isolated eigenvalues: Gaussian inputs

0 1 2 3 4

Eigenvalues of L

0 1 2 3 4

Eigenvalues of L

Figure: Eigenvalues of L and L, k = 3, p = 2048, n = 512, c1 = c2 = 1/4, c3 = 1/2,[µa]j = 4δaj , Ca = (1 + 2(a− 1)/

√p)Ip, f(x) = exp(−x/2).

26 / 113

Page 60: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 27/113

Theoretical Findings versus MNIST

0 10 20 30 40 500

5 · 10−2

0.1

0.15

0.2

Eigenvalues of L

Figure: Eigenvalues of L (red) and (equivalent Gaussian model) L (white), MNIST data, p = 784,n = 192.

27 / 113

Page 61: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 27/113

Theoretical Findings versus MNIST

0 10 20 30 40 500

5 · 10−2

0.1

0.15

0.2

Eigenvalues of L

Eigenvalues of L as if Gaussian model

Figure: Eigenvalues of L (red) and (equivalent Gaussian model) L (white), MNIST data, p = 784,n = 192.

27 / 113

Page 62: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 28/113

Theoretical Findings versus MNIST

Figure: Leading four eigenvectors of D−12KD−

12 for MNIST data (red) and theoretical findings

(blue).

28 / 113

Page 63: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 28/113

Theoretical Findings versus MNIST

Figure: Leading four eigenvectors of D−12KD−

12 for MNIST data (red) and theoretical findings

(blue).

28 / 113

Page 64: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 29/113

Theoretical Findings versus MNIST

−.08 −.07 −.06

−0.1

0

0.1

Eigenvector 2/Eigenvector 1

−0.1 0 0.1

−0.1

0

0.1

0.2

Eigenvector 3/Eigenvector 2

Figure: 2D representation of eigenvectors of L, for the MNIST dataset. Theoretical means and 1-and 2-standard deviations in blue. Class 1 in red, Class 2 in black, Class 3 in green.

29 / 113

Page 65: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 29/113

Theoretical Findings versus MNIST

−.08 −.07 −.06

−0.1

0

0.1

Eigenvector 2/Eigenvector 1

−0.1 0 0.1

−0.1

0

0.1

0.2

Eigenvector 3/Eigenvector 2

Figure: 2D representation of eigenvectors of L, for the MNIST dataset. Theoretical means and 1-and 2-standard deviations in blue. Class 1 in red, Class 2 in black, Class 3 in green.

29 / 113

Page 66: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 30/113

The surprising f ′(τ) = 0 case

−2 0 20

0.1

0.2

0.3

0.4

0.5

f ′(τ)

Cla

ssifi

cati

on

erro

r

p = 512

Figure: Polynomial kernel with f(τ) = 4, f ′′(τ) = 2, xi ∈ N (0, Ca), with C1 = Ip,

[C2]i,j = .4|i−j|, c0 = 14 .

I Trivial classification when t = 0, M = 0 and ‖T‖ = O(1).

30 / 113

Page 67: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 30/113

The surprising f ′(τ) = 0 case

−2 0 20

0.1

0.2

0.3

0.4

0.5

f ′(τ)

Cla

ssifi

cati

on

erro

r

p = 512

p = 1024

Figure: Polynomial kernel with f(τ) = 4, f ′′(τ) = 2, xi ∈ N (0, Ca), with C1 = Ip,

[C2]i,j = .4|i−j|, c0 = 14 .

I Trivial classification when t = 0, M = 0 and ‖T‖ = O(1).

30 / 113

Page 68: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 30/113

The surprising f ′(τ) = 0 case

−2 0 20

0.1

0.2

0.3

0.4

0.5

f ′(τ)

Cla

ssifi

cati

on

erro

r

p = 512

p = 1024

Theory

Figure: Polynomial kernel with f(τ) = 4, f ′′(τ) = 2, xi ∈ N (0, Ca), with C1 = Ip,

[C2]i,j = .4|i−j|, c0 = 14 .

I Trivial classification when t = 0, M = 0 and ‖T‖ = O(1).

30 / 113

Page 69: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering 30/113

The surprising f ′(τ) = 0 case

−2 0 20

0.1

0.2

0.3

0.4

0.5

f ′(τ)

Cla

ssifi

cati

on

erro

r

p = 512

p = 1024

Theory

Figure: Polynomial kernel with f(τ) = 4, f ′′(τ) = 2, xi ∈ N (0, Ca), with C1 = Ip,

[C2]i,j = .4|i−j|, c0 = 14 .

I Trivial classification when t = 0, M = 0 and ‖T‖ = O(1).

30 / 113

Page 70: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 31/113

Outline

Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models

ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√

p

Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs

Perspectives

31 / 113

Page 71: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 32/113

Position of the Problem

Problem: Cluster large data x1, . . . , xn ∈ Rp based on “spanned subspaces”.

Method:

I Still assume x1, . . . , xn belong to k classes C1, . . . , Ck.

I Zero-mean Gaussian model for the data: for xi ∈ Ck,

xi ∼ N (0, Ck).

I Performance of L = nD−12

(K − 1n1T

n

1TnD1n

)D−

12 , with

K =f(‖xi − xj‖2

)1≤i,j≤n

, x =x

‖x‖

in the regime n, p→∞.(alternatively, we can ask 1

ptrCi = 1 for all 1 ≤ i ≤ k)

32 / 113

Page 72: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 32/113

Position of the Problem

Problem: Cluster large data x1, . . . , xn ∈ Rp based on “spanned subspaces”.

Method:

I Still assume x1, . . . , xn belong to k classes C1, . . . , Ck.

I Zero-mean Gaussian model for the data: for xi ∈ Ck,

xi ∼ N (0, Ck).

I Performance of L = nD−12

(K − 1n1T

n

1TnD1n

)D−

12 , with

K =f(‖xi − xj‖2

)1≤i,j≤n

, x =x

‖x‖

in the regime n, p→∞.(alternatively, we can ask 1

ptrCi = 1 for all 1 ≤ i ≤ k)

32 / 113

Page 73: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 32/113

Position of the Problem

Problem: Cluster large data x1, . . . , xn ∈ Rp based on “spanned subspaces”.

Method:

I Still assume x1, . . . , xn belong to k classes C1, . . . , Ck.

I Zero-mean Gaussian model for the data: for xi ∈ Ck,

xi ∼ N (0, Ck).

I Performance of L = nD−12

(K − 1n1T

n

1TnD1n

)D−

12 , with

K =f(‖xi − xj‖2

)1≤i,j≤n

, x =x

‖x‖

in the regime n, p→∞.(alternatively, we can ask 1

ptrCi = 1 for all 1 ≤ i ≤ k)

32 / 113

Page 74: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 33/113

Model and Reminders

Assumption 1 [Classes]. Vectors x1, . . . , xn ∈ Rp i.i.d. from k-class Gaussian mixture,with xi ∈ Ck ⇔ xi ∼ N (0, Ck) (sorted by class for simplicity).

Assumption 2a [Growth Rates]. As n→∞, for each a ∈ 1, . . . , k,1. n

p→ c0 ∈ (0,∞)

2. nan→ ca ∈ (0,∞)

3. 1p

trCa = 1 and trCaCb = O(p), with Ca = Ca − C, C =

∑kb=1 cbCb.

Theorem (Corollary of Previous Section)Let f smooth with f ′(2) 6= 0. Then, under Assumptions 2a,

L = nD−12

(K −

1n1Tn

1TnD1n

)D−

12 , with K =

f(‖xi − xj‖2

)ni,j=1

(x = x/‖x‖)

exhibits phase transition phenomenon, i.e., leading eigenvectors of L asymptoticallycontain structural information about C1, . . . , Ck if and only if

T =

1

ptrCaC

b

ka,b=1

has sufficiently large eigenvalues (here M = 0, t = 0).

33 / 113

Page 75: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 33/113

Model and Reminders

Assumption 1 [Classes]. Vectors x1, . . . , xn ∈ Rp i.i.d. from k-class Gaussian mixture,with xi ∈ Ck ⇔ xi ∼ N (0, Ck) (sorted by class for simplicity).

Assumption 2a [Growth Rates]. As n→∞, for each a ∈ 1, . . . , k,1. n

p→ c0 ∈ (0,∞)

2. nan→ ca ∈ (0,∞)

3. 1p

trCa = 1 and trCaCb = O(p), with Ca = Ca − C, C =

∑kb=1 cbCb.

Theorem (Corollary of Previous Section)Let f smooth with f ′(2) 6= 0. Then, under Assumptions 2a,

L = nD−12

(K −

1n1Tn

1TnD1n

)D−

12 , with K =

f(‖xi − xj‖2

)ni,j=1

(x = x/‖x‖)

exhibits phase transition phenomenon, i.e., leading eigenvectors of L asymptoticallycontain structural information about C1, . . . , Ck if and only if

T =

1

ptrCaC

b

ka,b=1

has sufficiently large eigenvalues (here M = 0, t = 0).

33 / 113

Page 76: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 33/113

Model and Reminders

Assumption 1 [Classes]. Vectors x1, . . . , xn ∈ Rp i.i.d. from k-class Gaussian mixture,with xi ∈ Ck ⇔ xi ∼ N (0, Ck) (sorted by class for simplicity).

Assumption 2a [Growth Rates]. As n→∞, for each a ∈ 1, . . . , k,1. n

p→ c0 ∈ (0,∞)

2. nan→ ca ∈ (0,∞)

3. 1p

trCa = 1 and trCaCb = O(p), with Ca = Ca − C, C =

∑kb=1 cbCb.

Theorem (Corollary of Previous Section)Let f smooth with f ′(2) 6= 0. Then, under Assumptions 2a,

L = nD−12

(K −

1n1Tn

1TnD1n

)D−

12 , with K =

f(‖xi − xj‖2

)ni,j=1

(x = x/‖x‖)

exhibits phase transition phenomenon

, i.e., leading eigenvectors of L asymptoticallycontain structural information about C1, . . . , Ck if and only if

T =

1

ptrCaC

b

ka,b=1

has sufficiently large eigenvalues (here M = 0, t = 0).

33 / 113

Page 77: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 33/113

Model and Reminders

Assumption 1 [Classes]. Vectors x1, . . . , xn ∈ Rp i.i.d. from k-class Gaussian mixture,with xi ∈ Ck ⇔ xi ∼ N (0, Ck) (sorted by class for simplicity).

Assumption 2a [Growth Rates]. As n→∞, for each a ∈ 1, . . . , k,1. n

p→ c0 ∈ (0,∞)

2. nan→ ca ∈ (0,∞)

3. 1p

trCa = 1 and trCaCb = O(p), with Ca = Ca − C, C =

∑kb=1 cbCb.

Theorem (Corollary of Previous Section)Let f smooth with f ′(2) 6= 0. Then, under Assumptions 2a,

L = nD−12

(K −

1n1Tn

1TnD1n

)D−

12 , with K =

f(‖xi − xj‖2

)ni,j=1

(x = x/‖x‖)

exhibits phase transition phenomenon, i.e., leading eigenvectors of L asymptoticallycontain structural information about C1, . . . , Ck if and only if

T =

1

ptrCaC

b

ka,b=1

has sufficiently large eigenvalues (here M = 0, t = 0).

33 / 113

Page 78: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 34/113

The case f ′(2) = 0

Assumption 2b [Growth Rates]. As n→∞, for each a ∈ 1, . . . , k,1. n

p→ c0 ∈ (0,∞)

2. nan→ ca ∈ (0,∞)

3. 1p

trCa = 1 and trCaCb = O(p), with Ca = Ca − C, C =

∑kb=1 cbCb.

Remark: [Neyman–Pearson optimality]

I if Ci = Ip ± E with ‖E‖ → 0, detectability iif 1p

tr (C1 − C2)2 ≥ O(p−12 ).

Theorem (Random Equivalent for f ′(2) = 0)Let f be smooth with f ′(2) = 0 and

L ≡ √pf(2)

2f ′′(2)

[L−

f(0)− f(2)

f(2)P

], P = In −

1

n1n1T

n.

Then, under Assumptions 2b,

L = PΦP +

1√p

tr (CaCb )

1na1Tnb

p

ka,b=1

+ o‖·‖(1)

where Φij = δi 6=j√p[(xTi xj)

2 − E[(xTi xj)

2]].

34 / 113

Page 79: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 34/113

The case f ′(2) = 0

Assumption 2b [Growth Rates]. As n→∞, for each a ∈ 1, . . . , k,1. n

p→ c0 ∈ (0,∞)

2. nan→ ca ∈ (0,∞)

3. 1p

trCa = 1 and trCaCb = O(

√p), with Ca = Ca − C, C =

∑kb=1 cbCb.

(in this regime, previous kernels clearly fail)

Remark: [Neyman–Pearson optimality]

I if Ci = Ip ± E with ‖E‖ → 0, detectability iif 1p

tr (C1 − C2)2 ≥ O(p−12 ).

Theorem (Random Equivalent for f ′(2) = 0)Let f be smooth with f ′(2) = 0 and

L ≡ √pf(2)

2f ′′(2)

[L−

f(0)− f(2)

f(2)P

], P = In −

1

n1n1T

n.

Then, under Assumptions 2b,

L = PΦP +

1√p

tr (CaCb )

1na1Tnb

p

ka,b=1

+ o‖·‖(1)

where Φij = δi 6=j√p[(xTi xj)

2 − E[(xTi xj)

2]].

34 / 113

Page 80: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 34/113

The case f ′(2) = 0

Assumption 2b [Growth Rates]. As n→∞, for each a ∈ 1, . . . , k,1. n

p→ c0 ∈ (0,∞)

2. nan→ ca ∈ (0,∞)

3. 1p

trCa = 1 and trCaCb = O(

√p), with Ca = Ca − C, C =

∑kb=1 cbCb.

(in this regime, previous kernels clearly fail)

Remark: [Neyman–Pearson optimality]

I if Ci = Ip ± E with ‖E‖ → 0, detectability iif 1p

tr (C1 − C2)2 ≥ O(p−12 ).

Theorem (Random Equivalent for f ′(2) = 0)Let f be smooth with f ′(2) = 0 and

L ≡ √pf(2)

2f ′′(2)

[L−

f(0)− f(2)

f(2)P

], P = In −

1

n1n1T

n.

Then, under Assumptions 2b,

L = PΦP +

1√p

tr (CaCb )

1na1Tnb

p

ka,b=1

+ o‖·‖(1)

where Φij = δi 6=j√p[(xTi xj)

2 − E[(xTi xj)

2]].

34 / 113

Page 81: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 35/113

The case f ′(2) = 0

−2 −1.5 −1 −0.5 00

1

2

3

λ1(L)

λ2(L)

Eigenvalues of L

Figure: Eigenvalues of L, p = 1000, n = 2000, k = 3, c1 = c2 = 1/4, c3 = 1/2,

Ci ∝ Ip + (p/8)−54WiW

Ti , Wi ∈ Rp×(p/8) of i.i.d. N (0, 1) entries, f(t) = exp(−(t− 2)2).

⇒ No longer a Marcenko–Pastur like bulk, but rather a semi-circle bulk!

35 / 113

Page 82: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 36/113

The case f ′(2) = 0

36 / 113

Page 83: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 37/113

The case f ′(2) = 0

Roadmap. We now need to:

I study the spectrum of Φ

I study the isolated eigenvalues of L (and the phase transition)

I retrieve information from the eigenvectors.

Theorem (Semi-circle law for Φ)Let µn = 1

n

∑ni=1 δλi(L). Then, under Assumption 2b,

µna.s.−→ µ

with µ the semi-circle distribution

µ(dt) =1

2πc0ω2

√(4c0ω2 − t2)+dt, ω = lim

p→∞

√2

1

ptr (C)2.

37 / 113

Page 84: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 37/113

The case f ′(2) = 0

Roadmap. We now need to:

I study the spectrum of Φ

I study the isolated eigenvalues of L (and the phase transition)

I retrieve information from the eigenvectors.

Theorem (Semi-circle law for Φ)Let µn = 1

n

∑ni=1 δλi(L). Then, under Assumption 2b,

µna.s.−→ µ

with µ the semi-circle distribution

µ(dt) =1

2πc0ω2

√(4c0ω2 − t2)+dt, ω = lim

p→∞

√2

1

ptr (C)2.

37 / 113

Page 85: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 37/113

The case f ′(2) = 0

Roadmap. We now need to:

I study the spectrum of Φ

I study the isolated eigenvalues of L (and the phase transition)

I retrieve information from the eigenvectors.

Theorem (Semi-circle law for Φ)Let µn = 1

n

∑ni=1 δλi(L). Then, under Assumption 2b,

µna.s.−→ µ

with µ the semi-circle distribution

µ(dt) =1

2πc0ω2

√(4c0ω2 − t2)+dt, ω = lim

p→∞

√2

1

ptr (C)2.

37 / 113

Page 86: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 37/113

The case f ′(2) = 0

Roadmap. We now need to:

I study the spectrum of Φ

I study the isolated eigenvalues of L (and the phase transition)

I retrieve information from the eigenvectors.

Theorem (Semi-circle law for Φ)Let µn = 1

n

∑ni=1 δλi(L). Then, under Assumption 2b,

µna.s.−→ µ

with µ the semi-circle distribution

µ(dt) =1

2πc0ω2

√(4c0ω2 − t2)+dt, ω = lim

p→∞

√2

1

ptr (C)2.

37 / 113

Page 87: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 38/113

The case f ′(2) = 0

−2 −1.5 −1 −0.5 00

1

2

3

λ1(L)

λ2(L)

Eigenvalues of L

Semi-circle law

Figure: Eigenvalues of L, p = 1000, n = 2000, k = 3, c1 = c2 = 1/4, c3 = 1/2,

Ci ∝ Ip + (p/8)−54WiW

Ti , Wi ∈ Rp×(p/8) of i.i.d. N (0, 1) entries, f(t) = exp(−(t− 2)2).

38 / 113

Page 88: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 39/113

The case f ′(2) = 0

Denote now

T ≡ limp→∞

√cacb√p

trCaCb

ka,b=1

.

39 / 113

Page 89: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 39/113

The case f ′(2) = 0

Denote now

T ≡ limp→∞

√cacb√p

trCaCb

ka,b=1

.

Theorem (Isolated Eigenvalues)Let ν1 ≥ . . . ≥ νk eigenvalues of T . Then, if

√c0|νi| > ω, L has an isolated

eigenvalue λi satisfying

λia.s.−→ ρi ≡ c0νi +

ω2

νi.

39 / 113

Page 90: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 40/113

The case f ′(2) = 0

Theorem (Isolated Eigenvectors)For each isolated eigenpair (λi, ui) of L corresponding to (νi, vi) of T , write

ui =k∑a=1

αaija√na

+ σai wai

with ja = [0Tn1, . . . , 1T

na, . . . , 0T

nk]T, (wai )Tja = 0, supp(wai ) = supp(ja), ‖wai ‖ = 1.

Then, under Assumptions 1–2b,

αai αbi

a.s.−→(

1−1

c0

ω2

ν2i

)[viv

Ti ]ab

(σai )2 a.s.−→ca

c0

ω2

ν2i

and the fluctuations of ui, uj , i 6= j, are asymptotically uncorrelated.

40 / 113

Page 91: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 41/113

The case f ′(2) = 0

Eig

enve

ctor

1

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000

Eig

enve

ctor

2

Figure: Leading two eigenvectors of L (or equivalently of L) versus deterministic approximations ofαai ± σ

ai .

41 / 113

Page 92: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 41/113

The case f ′(2) = 0

1√n1

(α11 − σ

11)

1√n2

(α21 + σ2

1)

1√n2α2

1Eig

enve

ctor

1

0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000

Eig

enve

ctor

2

Figure: Leading two eigenvectors of L (or equivalently of L) versus deterministic approximations ofαai ± σ

ai .

41 / 113

Page 93: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 42/113

The case f ′(2) = 0

−4 −2 0 2 4

·10−2

−6

−4

−2

0

2

4

6

·10−2

Eigenvector 1

Eig

enve

ctor

2

Figure: Leading two eigenvectors of L (or equivalently of L) versus deterministic approximations ofαai ± σ

ai .

42 / 113

Page 94: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 43/113

Application to Massive MIMO UE Clustering

43 / 113

Page 95: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 44/113

Massive MIMO UE Clustering

Setting. Massive MIMO cell with

I p antenna elements

I n users equipments (UE) with channels x1, . . . , xn ∈ Rp

I UE’s belong to solid angle groups, i.e., E[xi] = 0, E[xixTi ] = Ca ≡ C(Θa).

I T independent channel observations x(1)i , . . . , x

(T )i for UE i.

Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoidpilot contamination).

Algorithm.

1. Build kernel matrix K, then L, based on nT vectors x(1)1 , . . . , x

(T )n (as if nT

values to cluster).

2. Extract dominant isolated eigenvectors u1, . . . , uκ

3. For each i, create ui = 1T

(In ⊗ 1TT )ui, i.e., average eigenvectors along time.

4. Perform k-class clustering on vectors u1, . . . , uκ.

44 / 113

Page 96: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 44/113

Massive MIMO UE Clustering

Setting. Massive MIMO cell with

I p antenna elements

I n users equipments (UE) with channels x1, . . . , xn ∈ Rp

I UE’s belong to solid angle groups, i.e., E[xi] = 0, E[xixTi ] = Ca ≡ C(Θa).

I T independent channel observations x(1)i , . . . , x

(T )i for UE i.

Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoidpilot contamination).

Algorithm.

1. Build kernel matrix K, then L, based on nT vectors x(1)1 , . . . , x

(T )n (as if nT

values to cluster).

2. Extract dominant isolated eigenvectors u1, . . . , uκ

3. For each i, create ui = 1T

(In ⊗ 1TT )ui, i.e., average eigenvectors along time.

4. Perform k-class clustering on vectors u1, . . . , uκ.

44 / 113

Page 97: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 44/113

Massive MIMO UE Clustering

Setting. Massive MIMO cell with

I p antenna elements

I n users equipments (UE) with channels x1, . . . , xn ∈ Rp

I UE’s belong to solid angle groups, i.e., E[xi] = 0, E[xixTi ] = Ca ≡ C(Θa).

I T independent channel observations x(1)i , . . . , x

(T )i for UE i.

Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoidpilot contamination).

Algorithm.

1. Build kernel matrix K, then L, based on nT vectors x(1)1 , . . . , x

(T )n (as if nT

values to cluster).

2. Extract dominant isolated eigenvectors u1, . . . , uκ

3. For each i, create ui = 1T

(In ⊗ 1TT )ui, i.e., average eigenvectors along time.

4. Perform k-class clustering on vectors u1, . . . , uκ.

44 / 113

Page 98: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 44/113

Massive MIMO UE Clustering

Setting. Massive MIMO cell with

I p antenna elements

I n users equipments (UE) with channels x1, . . . , xn ∈ Rp

I UE’s belong to solid angle groups, i.e., E[xi] = 0, E[xixTi ] = Ca ≡ C(Θa).

I T independent channel observations x(1)i , . . . , x

(T )i for UE i.

Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoidpilot contamination).

Algorithm.

1. Build kernel matrix K, then L, based on nT vectors x(1)1 , . . . , x

(T )n (as if nT

values to cluster).

2. Extract dominant isolated eigenvectors u1, . . . , uκ

3. For each i, create ui = 1T

(In ⊗ 1TT )ui, i.e., average eigenvectors along time.

4. Perform k-class clustering on vectors u1, . . . , uκ.

44 / 113

Page 99: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 44/113

Massive MIMO UE Clustering

Setting. Massive MIMO cell with

I p antenna elements

I n users equipments (UE) with channels x1, . . . , xn ∈ Rp

I UE’s belong to solid angle groups, i.e., E[xi] = 0, E[xixTi ] = Ca ≡ C(Θa).

I T independent channel observations x(1)i , . . . , x

(T )i for UE i.

Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoidpilot contamination).

Algorithm.

1. Build kernel matrix K, then L, based on nT vectors x(1)1 , . . . , x

(T )n (as if nT

values to cluster).

2. Extract dominant isolated eigenvectors u1, . . . , uκ

3. For each i, create ui = 1T

(In ⊗ 1TT )ui, i.e., average eigenvectors along time.

4. Perform k-class clustering on vectors u1, . . . , uκ.

44 / 113

Page 100: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 44/113

Massive MIMO UE Clustering

Setting. Massive MIMO cell with

I p antenna elements

I n users equipments (UE) with channels x1, . . . , xn ∈ Rp

I UE’s belong to solid angle groups, i.e., E[xi] = 0, E[xixTi ] = Ca ≡ C(Θa).

I T independent channel observations x(1)i , . . . , x

(T )i for UE i.

Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoidpilot contamination).

Algorithm.

1. Build kernel matrix K, then L, based on nT vectors x(1)1 , . . . , x

(T )n (as if nT

values to cluster).

2. Extract dominant isolated eigenvectors u1, . . . , uκ

3. For each i, create ui = 1T

(In ⊗ 1TT )ui, i.e., average eigenvectors along time.

4. Perform k-class clustering on vectors u1, . . . , uκ.

44 / 113

Page 101: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 44/113

Massive MIMO UE Clustering

Setting. Massive MIMO cell with

I p antenna elements

I n users equipments (UE) with channels x1, . . . , xn ∈ Rp

I UE’s belong to solid angle groups, i.e., E[xi] = 0, E[xixTi ] = Ca ≡ C(Θa).

I T independent channel observations x(1)i , . . . , x

(T )i for UE i.

Objective. Clustering users in same solid angle groups (for scheduling reasons, to avoidpilot contamination).

Algorithm.

1. Build kernel matrix K, then L, based on nT vectors x(1)1 , . . . , x

(T )n (as if nT

values to cluster).

2. Extract dominant isolated eigenvectors u1, . . . , uκ

3. For each i, create ui = 1T

(In ⊗ 1TT )ui, i.e., average eigenvectors along time.

4. Perform k-class clustering on vectors u1, . . . , uκ.

44 / 113

Page 102: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 45/113

Massive MIMO UE Clustering

−0.1 0 0.1

−0.2

−0.1

0

0.1

Eigenvector 1

Eig

enve

ctor

2

−5 0 5

·10−2

−5

0

5

·10−2

Eigenvector 1

Figure: Leading two eigenvectors before (left figure) and after (right figure) T -averaging. Setting:p = 400, n = 40, T = 10, k = 3, c1 = c3 = 1/4, c2 = 1/2, angular spread model with angles−π/30± π/20, 0± π/20, and π/30± π/20. Kernel function f(t) = exp(−(t− 2)2).

45 / 113

Page 103: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 46/113

Massive MIMO UE Clustering

2 4 6 8 100.4

0.6

0.8

1

Random guess

T

Cor

rect

clu

ster

ing

pro

ba

bilit

y

k-means (oracle)

k-means

EM (oracle)

EM

Figure: Overlap for different T , using the k-means or EM starting from actual centroid solutions(oracle) or randomly.

46 / 113

Page 104: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = 0 47/113

Massive MIMO UE Clustering

2 4 6 8 10

0.4

0.6

0.8

1

Random guess

T

Cor

rect

clu

ster

ing

pro

ba

bilit

y

Optimal kernel (k-means)

Optimal kernel (EM)

Gaussian kernel (k-means)

Gaussian kernel (EM)

Figure: Overlap for optimal kernel f(t) (here f(t) = exp(−(t− 2)2)) and Gaussian kernelf(t) = exp(−t2), for different T , using the k-means or EM.

47 / 113

Page 105: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = α√p

48/113

Outline

Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models

ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√

p

Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs

Perspectives

48 / 113

Page 106: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = α√p

49/113

Optimal growth rates and optimal kernels

Conclusion of previous analyses:

I kernel f( 1p‖xi − xj‖2) with f ′(τ) 6= 0:

I optimal in ‖µa‖ = O(1), 1p trCa = O(p−

12 )

I suboptimal in 1p trCaC

b = O(1)

−→ Model type: Marcenko–Pastur + spikes.

I kernel f( 1p‖xi − xj‖2) with f ′(τ) = 0:

I suboptimal in ‖µa‖ O(1) (kills the means)

I suboptimal in 1p trCaC

b = O(p−

12 )

−→ Model type: smaller order semi-circle law + spikes.

Jointly optimal solution:

I evenly weighing Marcenko–Pastur and semi-circle laws

I the “α-β” kernel:

f ′(τ) =α√p,

1

2f ′′(τ) = β.

49 / 113

Page 107: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = α√p

49/113

Optimal growth rates and optimal kernels

Conclusion of previous analyses:

I kernel f( 1p‖xi − xj‖2) with f ′(τ) 6= 0:

I optimal in ‖µa‖ = O(1), 1p trCa = O(p−

12 )

I suboptimal in 1p trCaC

b = O(1)

−→ Model type: Marcenko–Pastur + spikes.

I kernel f( 1p‖xi − xj‖2) with f ′(τ) = 0:

I suboptimal in ‖µa‖ O(1) (kills the means)

I suboptimal in 1p trCaC

b = O(p−

12 )

−→ Model type: smaller order semi-circle law + spikes.

Jointly optimal solution:

I evenly weighing Marcenko–Pastur and semi-circle laws

I the “α-β” kernel:

f ′(τ) =α√p,

1

2f ′′(τ) = β.

49 / 113

Page 108: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = α√p

49/113

Optimal growth rates and optimal kernels

Conclusion of previous analyses:

I kernel f( 1p‖xi − xj‖2) with f ′(τ) 6= 0:

I optimal in ‖µa‖ = O(1), 1p trCa = O(p−

12 )

I suboptimal in 1p trCaC

b = O(1)

−→ Model type: Marcenko–Pastur + spikes.

I kernel f( 1p‖xi − xj‖2) with f ′(τ) = 0:

I suboptimal in ‖µa‖ O(1) (kills the means)

I suboptimal in 1p trCaC

b = O(p−

12 )

−→ Model type: smaller order semi-circle law + spikes.

Jointly optimal solution:

I evenly weighing Marcenko–Pastur and semi-circle laws

I the “α-β” kernel:

f ′(τ) =α√p,

1

2f ′′(τ) = β.

49 / 113

Page 109: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = α√p

49/113

Optimal growth rates and optimal kernels

Conclusion of previous analyses:

I kernel f( 1p‖xi − xj‖2) with f ′(τ) 6= 0:

I optimal in ‖µa‖ = O(1), 1p trCa = O(p−

12 )

I suboptimal in 1p trCaC

b = O(1)

−→ Model type: Marcenko–Pastur + spikes.

I kernel f( 1p‖xi − xj‖2) with f ′(τ) = 0:

I suboptimal in ‖µa‖ O(1) (kills the means)

I suboptimal in 1p trCaC

b = O(p−

12 )

−→ Model type: smaller order semi-circle law + spikes.

Jointly optimal solution:

I evenly weighing Marcenko–Pastur and semi-circle laws

I the “α-β” kernel:

f ′(τ) =α√p,

1

2f ′′(τ) = β.

49 / 113

Page 110: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = α√p

50/113

New assumption setting

I We consider now a fully optimal growth rate setting

Assumption (Optimal Growth Rate)As n→∞,

1. Data scaling: pn→ c0 ∈ (0,∞), na

n→ ca ∈ (0, 1),

2. Mean scaling: with µ ,∑ka=1

nanµa and µa , µa − µ, then ‖µa‖ = O(1)

3. Covariance scaling: with C ,∑ka=1

nanCa and Ca , Ca − C, then

‖Ca‖ = O(1), trCa = O(√p), trCaC

b = O(

√p).

Kernel:

I For technical simplicity, we consider

K = PKP = P

f

(1

p(x)T(xj )

)ni,j=1

P P = In −1

n1n1T

n.

i.e., τ replaced by 0.

50 / 113

Page 111: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = α√p

50/113

New assumption setting

I We consider now a fully optimal growth rate setting

Assumption (Optimal Growth Rate)As n→∞,

1. Data scaling: pn→ c0 ∈ (0,∞), na

n→ ca ∈ (0, 1),

2. Mean scaling: with µ ,∑ka=1

nanµa and µa , µa − µ, then ‖µa‖ = O(1)

3. Covariance scaling: with C ,∑ka=1

nanCa and Ca , Ca − C, then

‖Ca‖ = O(1), trCa = O(√p), trCaC

b = O(

√p).

Kernel:

I For technical simplicity, we consider

K = PKP = P

f

(1

p(x)T(xj )

)ni,j=1

P P = In −1

n1n1T

n.

i.e., τ replaced by 0.

50 / 113

Page 112: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = α√p

51/113

Main Results

TheoremAs n→∞, ∥∥∥√p (PKP +

(f(0) + τf ′(0)

)P)− K

∥∥∥ a.s.−→ 0

with, for α =√pf ′(0) = O(1) and β = 1

2f ′′(0) = O(1),

K = αPWTWP + βPΦP + UAUT

A =

[αMTM + βT αIk

αIk 0

]U =

[J√p, PWTM

]Φ√p

=

((ωi )Tωj )2δi6=j

ni,j=1

tr (CaCb)

p21na1Tnb

ka,b=1

.

Role of α, β:

I Weighs Marcenko–Pastur versus semi-circle parts.

51 / 113

Page 113: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = α√p

51/113

Main Results

TheoremAs n→∞, ∥∥∥√p (PKP +

(f(0) + τf ′(0)

)P)− K

∥∥∥ a.s.−→ 0

with, for α =√pf ′(0) = O(1) and β = 1

2f ′′(0) = O(1),

K = αPWTWP + βPΦP + UAUT

A =

[αMTM + βT αIk

αIk 0

]U =

[J√p, PWTM

]Φ√p

=

((ωi )Tωj )2δi6=j

ni,j=1

tr (CaCb)

p21na1Tnb

ka,b=1

.

Role of α, β:

I Weighs Marcenko–Pastur versus semi-circle parts.

51 / 113

Page 114: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = α√p

52/113

Limiting eigenvalue distribution

Theorem (Eigenvalues Bulk)As p→∞,

νn ,1

n

n∑i=1

δλi(K)

a.s.−→ ν

with ν having Stieltjes transform m(z) solution of

1

m(z)= −z +

α

ptrC

(Ik +

αm(z)

c0C)−1

−2β2

c0ω2m(z)

where ω = limp→∞1p

tr (C)2.

52 / 113

Page 115: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = α√p

53/113

Limiting eigenvalue distribution

Figure: Eigenvalues of K (up to recentering) versus limiting law, p = 2048, n = 4096, k = 2,

n1 = n2, µi = 3δi, f(x) = 12β(x+ 1√

pαβ

)2. (Top left): α = 8, β = 1, (Top right):

α = 4, β = 3, (Bottom left): α = 3, β = 4, (Bottom right): α = 1, β = 8.

53 / 113

Page 116: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = α√p

54/113

Asymptotic performances: MNIST

I MNIST is “means-dominant” but not that much!

Datasets ‖µ1 − µ2‖2 1√

ptr (C1 −C2)2 1ptr (C1 −C2)2

MNIST (digits 1, 7) 612.7 71.1 2.5MNIST (digits 3, 6) 441.3 39.9 1.4MNIST (digits 3, 8) 212.3 23.5 0.8

−15 −10 −5 0 5 10 15

0.6

0.8

1

αβ

Ove

rla

p

Digits 1, 7

Digits 3, 6

Digits 3, 8

Figure: Spectral clustering of the MNIST database for varying αβ .

54 / 113

Page 117: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = α√p

54/113

Asymptotic performances: MNIST

I MNIST is “means-dominant” but not that much!

Datasets ‖µ1 − µ2‖2 1√

ptr (C1 −C2)2 1ptr (C1 −C2)2

MNIST (digits 1, 7) 612.7 71.1 2.5MNIST (digits 3, 6) 441.3 39.9 1.4MNIST (digits 3, 8) 212.3 23.5 0.8

−15 −10 −5 0 5 10 15

0.6

0.8

1

αβ

Ove

rla

p

Digits 1, 7

Digits 3, 6

Digits 3, 8

Figure: Spectral clustering of the MNIST database for varying αβ .

54 / 113

Page 118: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = α√p

55/113

Asymptotic performances: EEG data

I EEG data are “variance-dominant”

Datasets ‖µ1 − µ2‖2 1√

ptr (C1 −C2)2 1ptr (C1 −C2)2

EEG (sets A,E) 2.4 10.9 1.1

−60 −40 −20 0 20 40

0.6

0.8

1

αβ

Ove

rla

p

Figure: Spectral clustering of the EEG database for varying αβ .

55 / 113

Page 119: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Kernel Spectral Clustering: The case f′(τ) = α√p

55/113

Asymptotic performances: EEG data

I EEG data are “variance-dominant”

Datasets ‖µ1 − µ2‖2 1√

ptr (C1 −C2)2 1ptr (C1 −C2)2

EEG (sets A,E) 2.4 10.9 1.1

−60 −40 −20 0 20 40

0.6

0.8

1

αβ

Ove

rla

p

Figure: Spectral clustering of the EEG database for varying αβ .

55 / 113

Page 120: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 56/113

Outline

Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models

ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√

p

Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs

Perspectives

56 / 113

Page 121: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 57/113

Problem Statement

Context: Similar to clustering:

I Classify x1, . . . , xn ∈ Rp in k classes, with nl labelled and nu unlabelled data.

I Problem statement: give scores Fia (di = [K1n]i)

F = argminF∈Rn×k

k∑a=1

∑i,j

Kij(Fiadα−1i − Fjadα−1

j )2

such that Fia = δxi∈Ca, for all labelled xi.

I Solution: for F (u) ∈ Rnu×k, F (l) ∈ Rnl×k scores of unlabelled/labelled data,

F (u) =(Inu −D

−α(u)

K(u,u)Dα−1(u)

)−1D−α

(u)K(u,l)D

α−1(l)

F (l)

where we naturally decompose

K =

[K(l,l) K(l,u)

K(u,l) K(u,u)

]D =

[D(l) 0

0 D(u)

]= diag K1n .

57 / 113

Page 122: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 57/113

Problem Statement

Context: Similar to clustering:

I Classify x1, . . . , xn ∈ Rp in k classes, with nl labelled and nu unlabelled data.

I Problem statement: give scores Fia (di = [K1n]i)

F = argminF∈Rn×k

k∑a=1

∑i,j

Kij(Fiadα−1i − Fjadα−1

j )2

such that Fia = δxi∈Ca, for all labelled xi.

I Solution: for F (u) ∈ Rnu×k, F (l) ∈ Rnl×k scores of unlabelled/labelled data,

F (u) =(Inu −D

−α(u)

K(u,u)Dα−1(u)

)−1D−α

(u)K(u,l)D

α−1(l)

F (l)

where we naturally decompose

K =

[K(l,l) K(l,u)

K(u,l) K(u,u)

]D =

[D(l) 0

0 D(u)

]= diag K1n .

57 / 113

Page 123: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 57/113

Problem Statement

Context: Similar to clustering:

I Classify x1, . . . , xn ∈ Rp in k classes, with nl labelled and nu unlabelled data.

I Problem statement: give scores Fia (di = [K1n]i)

F = argminF∈Rn×k

k∑a=1

∑i,j

Kij(Fiadα−1i − Fjadα−1

j )2

such that Fia = δxi∈Ca, for all labelled xi.

I Solution: for F (u) ∈ Rnu×k, F (l) ∈ Rnl×k scores of unlabelled/labelled data,

F (u) =(Inu −D

−α(u)

K(u,u)Dα−1(u)

)−1D−α

(u)K(u,l)D

α−1(l)

F (l)

where we naturally decompose

K =

[K(l,l) K(l,u)

K(u,l) K(u,u)

]D =

[D(l) 0

0 D(u)

]= diag K1n .

57 / 113

Page 124: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 58/113

The finite-dimensional intuition: What we expect

Figure: Typical expected performance output

58 / 113

Page 125: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 58/113

The finite-dimensional intuition: What we expect

Figure: Typical expected performance output

58 / 113

Page 126: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 58/113

The finite-dimensional intuition: What we expect

Figure: Typical expected performance output

58 / 113

Page 127: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 59/113

MNIST Data Example

0 50 100 150

0.8

1

1.2

Index

F(u

)·,a

[F(u)]·,1 (Zeros)

Figure: Vectors [F (u)]·,a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192,p = 784, nl/n = 1/16, Gaussian kernel.

59 / 113

Page 128: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 59/113

MNIST Data Example

0 50 100 150

0.8

1

1.2

Index

F(u

)·,a

[F(u)]·,1 (Zeros)

[F(u)]·,2 (Ones)

Figure: Vectors [F (u)]·,a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192,p = 784, nl/n = 1/16, Gaussian kernel.

59 / 113

Page 129: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 59/113

MNIST Data Example

0 50 100 150

0.8

1

1.2

Index

F(u

)·,a

[F(u)]·,1 (Zeros)

[F(u)]·,2 (Ones)

[F(u)]·,3 (Twos)

Figure: Vectors [F (u)]·,a, a = 1, 2, 3, for 3-class MNIST data (zeros, ones, twos), n = 192,p = 784, nl/n = 1/16, Gaussian kernel.

59 / 113

Page 130: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 60/113

MNIST Data Example

0 50 100 150

−4

−2

0

2

4

·10−2

Index

F(u

)·,a−

1 k

∑ k b=1F

(u

)·,b

[F(u)

]·,1 (Zeros)

Figure: Centered Vectors [F(u)]·,a = [F(u) − 1kF(u)1k1T

k]·,a, 3-class MNIST data (zeros, ones,

twos), α = 0, n = 192, p = 784, nl/n = 1/16, Gaussian kernel.

60 / 113

Page 131: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 60/113

MNIST Data Example

0 50 100 150

−4

−2

0

2

4

·10−2

Index

F(u

)·,a−

1 k

∑ k b=1F

(u

)·,b

[F(u)

]·,1 (Zeros)

[F(u)

]·,2 (Ones)

Figure: Centered Vectors [F(u)]·,a = [F(u) − 1kF(u)1k1T

k]·,a, 3-class MNIST data (zeros, ones,

twos), α = 0, n = 192, p = 784, nl/n = 1/16, Gaussian kernel.

60 / 113

Page 132: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 60/113

MNIST Data Example

0 50 100 150

−4

−2

0

2

4

·10−2

Index

F(u

)·,a−

1 k

∑ k b=1F

(u

)·,b

[F(u)

]·,1 (Zeros)

[F(u)

]·,2 (Ones)

[F(u)

]·,3 (Twos)

Figure: Centered Vectors [F(u)]·,a = [F(u) − 1kF(u)1k1T

k]·,a, 3-class MNIST data (zeros, ones,

twos), α = 0, n = 192, p = 784, nl/n = 1/16, Gaussian kernel.

60 / 113

Page 133: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 61/113

Theoretical Findings

Method: Assume nl/n→ cl ∈ (0, 1)

I We aim at characterizing

F (u) =(Inu −D

−α(u)

K(u,u)Dα−1(u)

)−1D−α

(u)K(u,l)D

α−1(l)

F (l)

I Taylor expansion of K as n, p→∞,

K(u,u) = f(τ)1nu1Tnu

+O‖·‖(n− 1

2 )

D(u) = nf(τ)Inu +O(n12 )

and similarly for K(u,l), D(l).

I So that

(Inu −D

−α(u)

K(u,u)Dα−1(u)

)−1=

(Inu −

1nu1Tnu

n+O‖·‖(n

− 12 )

)−1

easily Taylor expanded.

61 / 113

Page 134: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 61/113

Theoretical Findings

Method: Assume nl/n→ cl ∈ (0, 1)

I We aim at characterizing

F (u) =(Inu −D

−α(u)

K(u,u)Dα−1(u)

)−1D−α

(u)K(u,l)D

α−1(l)

F (l)

I Taylor expansion of K as n, p→∞,

K(u,u) = f(τ)1nu1Tnu

+O‖·‖(n− 1

2 )

D(u) = nf(τ)Inu +O(n12 )

and similarly for K(u,l), D(l).

I So that

(Inu −D

−α(u)

K(u,u)Dα−1(u)

)−1=

(Inu −

1nu1Tnu

n+O‖·‖(n

− 12 )

)−1

easily Taylor expanded.

61 / 113

Page 135: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 61/113

Theoretical Findings

Method: Assume nl/n→ cl ∈ (0, 1)

I We aim at characterizing

F (u) =(Inu −D

−α(u)

K(u,u)Dα−1(u)

)−1D−α

(u)K(u,l)D

α−1(l)

F (l)

I Taylor expansion of K as n, p→∞,

K(u,u) = f(τ)1nu1Tnu

+O‖·‖(n− 1

2 )

D(u) = nf(τ)Inu +O(n12 )

and similarly for K(u,l), D(l).

I So that

(Inu −D

−α(u)

K(u,u)Dα−1(u)

)−1=

(Inu −

1nu1Tnu

n+O‖·‖(n

− 12 )

)−1

easily Taylor expanded.

61 / 113

Page 136: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 62/113

Main Results

Results: Assuming nl/n→ cl ∈ (0, 1), by previous Taylor expansion,

I In the first order,

F(u)·,a = C

nl,a

n

[v︸︷︷︸

O(1)

+αta1nu√

n︸ ︷︷ ︸O(n− 1

2 )

]+ O(n−1)︸ ︷︷ ︸

Informative terms

where v = O(1) random vector (entry-wise) and ta = 1√p

trCa .

I Consequences:I Random non-informative bias v

I Strong Impact of nl,a

F(u)·,a to be scaled by nl,a

I Additional per-class bias αta1nu

α = 0 + β√p .

62 / 113

Page 137: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 62/113

Main Results

Results: Assuming nl/n→ cl ∈ (0, 1), by previous Taylor expansion,

I In the first order,

F(u)·,a = C

nl,a

n

[v︸︷︷︸

O(1)

+αta1nu√

n︸ ︷︷ ︸O(n− 1

2 )

]+ O(n−1)︸ ︷︷ ︸

Informative terms

where v = O(1) random vector (entry-wise) and ta = 1√p

trCa .

I Consequences:

I Random non-informative bias v

I Strong Impact of nl,a

F(u)·,a to be scaled by nl,a

I Additional per-class bias αta1nu

α = 0 + β√p .

62 / 113

Page 138: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 62/113

Main Results

Results: Assuming nl/n→ cl ∈ (0, 1), by previous Taylor expansion,

I In the first order,

F(u)·,a = C

nl,a

n

[v︸︷︷︸

O(1)

+αta1nu√

n︸ ︷︷ ︸O(n− 1

2 )

]+ O(n−1)︸ ︷︷ ︸

Informative terms

where v = O(1) random vector (entry-wise) and ta = 1√p

trCa .

I Consequences:I Random non-informative bias v

I Strong Impact of nl,a

F(u)·,a to be scaled by nl,a

I Additional per-class bias αta1nu

α = 0 + β√p .

62 / 113

Page 139: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 62/113

Main Results

Results: Assuming nl/n→ cl ∈ (0, 1), by previous Taylor expansion,

I In the first order,

F(u)·,a = C

nl,a

n

[v︸︷︷︸

O(1)

+αta1nu√

n︸ ︷︷ ︸O(n− 1

2 )

]+ O(n−1)︸ ︷︷ ︸

Informative terms

where v = O(1) random vector (entry-wise) and ta = 1√p

trCa .

I Consequences:I Random non-informative bias v

I Strong Impact of nl,a

F(u)·,a to be scaled by nl,a

I Additional per-class bias αta1nu

α = 0 + β√p .

62 / 113

Page 140: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 62/113

Main Results

Results: Assuming nl/n→ cl ∈ (0, 1), by previous Taylor expansion,

I In the first order,

F(u)·,a = C

nl,a

n

[v︸︷︷︸

O(1)

+αta1nu√

n︸ ︷︷ ︸O(n− 1

2 )

]+ O(n−1)︸ ︷︷ ︸

Informative terms

where v = O(1) random vector (entry-wise) and ta = 1√p

trCa .

I Consequences:I Random non-informative bias v

I Strong Impact of nl,a

F(u)·,a to be scaled by nl,a

I Additional per-class bias αta1nu

α = 0 + β√p .

62 / 113

Page 141: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 63/113

Main Results

As a consequence of the remarks above, we take

α =β√p

and define

F(u)i,a =

np

nl,aF

(u)ia .

TheoremFor xi ∈ Cb unlabelled,

Fi,· −Gb → 0, Gb ∼ N (mb,Σb)

where mb ∈ Rk, Σb ∈ Rk×k given by

(mb)a = −2f ′(τ)

f(τ)Mab +

f ′′(τ)

f(τ)ta tb +

2f ′′(τ)

f(τ)Tab −

f ′(τ)2

f(τ)2tatb + β

n

nl

f ′(τ)

f(τ)ta +Bb

(Σb)a1a2 =2trC2

b

p

(f ′(τ)2

f(τ)2−f ′′(τ)

f(τ)

)2

ta1 ta2 +4f ′(τ)2

f(τ)2

([MTCbM ]a1a2 +

δa2a1 p

nl,a1

Tba1

)with t, T,M as before, Xa = Xa −

∑kd=1

nl,dnl

Xd and Bb bias independent of a.

63 / 113

Page 142: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 63/113

Main Results

As a consequence of the remarks above, we take

α =β√p

and define

F(u)i,a =

np

nl,aF

(u)ia .

TheoremFor xi ∈ Cb unlabelled,

Fi,· −Gb → 0, Gb ∼ N (mb,Σb)

where mb ∈ Rk, Σb ∈ Rk×k given by

(mb)a = −2f ′(τ)

f(τ)Mab +

f ′′(τ)

f(τ)ta tb +

2f ′′(τ)

f(τ)Tab −

f ′(τ)2

f(τ)2tatb + β

n

nl

f ′(τ)

f(τ)ta +Bb

(Σb)a1a2 =2trC2

b

p

(f ′(τ)2

f(τ)2−f ′′(τ)

f(τ)

)2

ta1 ta2 +4f ′(τ)2

f(τ)2

([MTCbM ]a1a2 +

δa2a1 p

nl,a1

Tba1

)with t, T,M as before, Xa = Xa −

∑kd=1

nl,dnl

Xd and Bb bias independent of a.

63 / 113

Page 143: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 64/113

Main Results

Corollary (Asymptotic Classification Error)For k = 2 classes and a 6= b,

P (Fi,a > Fib | xi ∈ Cb)−Q(

(mb)b − (mb)a√[1,−1]Σb[1,−1]T

)→ 0.

Some consequences:

I non obvious choices of appropriate kernels

I non obvious choice of optimal β (induces a possibly beneficial bias)

I importance of nl versus nu.

64 / 113

Page 144: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 64/113

Main Results

Corollary (Asymptotic Classification Error)For k = 2 classes and a 6= b,

P (Fi,a > Fib | xi ∈ Cb)−Q(

(mb)b − (mb)a√[1,−1]Σb[1,−1]T

)→ 0.

Some consequences:

I non obvious choices of appropriate kernels

I non obvious choice of optimal β (induces a possibly beneficial bias)

I importance of nl versus nu.

64 / 113

Page 145: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 65/113

MNIST Data Example

−1 −0.5 0 0.5 1

0.4

0.6

0.8

1

Index

Pro

ba

bilit

yo

fco

rrec

tcl

ass

ifica

tio

n

Simulations

Figure: Performance as a function of α, for 3-class MNIST data (zeros, ones, twos), n = 192,p = 784, nl/n = 1/16, Gaussian kernel.

65 / 113

Page 146: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 65/113

MNIST Data Example

−1 −0.5 0 0.5 1

0.4

0.6

0.8

1

Index

Pro

ba

bilit

yo

fco

rrec

tcl

ass

ifica

tio

n

Simulations

Theory as if Gaussian

Figure: Performance as a function of α, for 3-class MNIST data (zeros, ones, twos), n = 192,p = 784, nl/n = 1/16, Gaussian kernel.

65 / 113

Page 147: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 66/113

MNIST Data Example

−1 −0.5 0 0.5 1

0.6

0.8

1

Index

Pro

ba

bilit

yo

fco

rrec

tcl

ass

ifica

tio

n

Simulations

Figure: Performance as a function of α, for 2-class MNIST data (zeros, ones), n = 1568, p = 784,nl/n = 1/16, Gaussian kernel.

66 / 113

Page 148: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 66/113

MNIST Data Example

−1 −0.5 0 0.5 1

0.6

0.8

1

Index

Pro

ba

bilit

yo

fco

rrec

tcl

ass

ifica

tio

n

Simulations

Theory as if Gaussian

Figure: Performance as a function of α, for 2-class MNIST data (zeros, ones), n = 1568, p = 784,nl/n = 1/16, Gaussian kernel.

66 / 113

Page 149: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 67/113

Is semi-supervised learning really semi-supervised?

Reminder:For xi ∈ Cb unlabelled, Fi,· −Gb → 0, Gb ∼ N (mb,Σb) with

(mb)a = −2f ′(τ)

f(τ)Mab +

f ′′(τ)

f(τ)ta tb +

2f ′′(τ)

f(τ)Tab −

f ′(τ)2

f(τ)2tatb + β

n

nl

f ′(τ)

f(τ)ta +Bb

(Σb)a1a2 =2trC2

b

p

(f ′(τ)2

f(τ)2−f ′′(τ)

f(τ)

)2

ta1 ta2 +4f ′(τ)2

f(τ)2

([MTCbM ]a1a2 +

δa2a1 p

nl,a1

Tba1

)with t, T,M as before, Xa = Xa −

∑kd=1

nl,dnl

Xd and Bb bias independent of a.

The problem with unlabelled data:

I Result does not depend on nu!−→ increasing nu asymptotically non beneficial.

I Even best Laplacian regularizer brings SSL to be merely supervised learning.

67 / 113

Page 150: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 67/113

Is semi-supervised learning really semi-supervised?

Reminder:For xi ∈ Cb unlabelled, Fi,· −Gb → 0, Gb ∼ N (mb,Σb) with

(mb)a = −2f ′(τ)

f(τ)Mab +

f ′′(τ)

f(τ)ta tb +

2f ′′(τ)

f(τ)Tab −

f ′(τ)2

f(τ)2tatb + β

n

nl

f ′(τ)

f(τ)ta +Bb

(Σb)a1a2 =2trC2

b

p

(f ′(τ)2

f(τ)2−f ′′(τ)

f(τ)

)2

ta1 ta2 +4f ′(τ)2

f(τ)2

([MTCbM ]a1a2 +

δa2a1 p

nl,a1

Tba1

)with t, T,M as before, Xa = Xa −

∑kd=1

nl,dnl

Xd and Bb bias independent of a.

The problem with unlabelled data:

I Result does not depend on nu!−→ increasing nu asymptotically non beneficial.

I Even best Laplacian regularizer brings SSL to be merely supervised learning.

67 / 113

Page 151: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning 67/113

Is semi-supervised learning really semi-supervised?

Reminder:For xi ∈ Cb unlabelled, Fi,· −Gb → 0, Gb ∼ N (mb,Σb) with

(mb)a = −2f ′(τ)

f(τ)Mab +

f ′′(τ)

f(τ)ta tb +

2f ′′(τ)

f(τ)Tab −

f ′(τ)2

f(τ)2tatb + β

n

nl

f ′(τ)

f(τ)ta +Bb

(Σb)a1a2 =2trC2

b

p

(f ′(τ)2

f(τ)2−f ′′(τ)

f(τ)

)2

ta1 ta2 +4f ′(τ)2

f(τ)2

([MTCbM ]a1a2 +

δa2a1 p

nl,a1

Tba1

)with t, T,M as before, Xa = Xa −

∑kd=1

nl,dnl

Xd and Bb bias independent of a.

The problem with unlabelled data:

I Result does not depend on nu!−→ increasing nu asymptotically non beneficial.

I Even best Laplacian regularizer brings SSL to be merely supervised learning.

67 / 113

Page 152: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning improved 68/113

Outline

Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models

ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√

p

Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs

Perspectives

68 / 113

Page 153: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning improved 69/113

Resurrecting SSL by centering

Reminder:

F = argminF∈Rn×k

k∑a=1

∑i,j

Kij(Fiadα−1i − Fjadα−1

j )2 with F(l)ia = δxi∈Ca

⇔ F (u) =(Inu −D

−α(u)

K(u,u)Dα−1(u)

)−1D−α

(u)K(u,l)D

α−1(l)

F (l).

Domination of score flattening:

I Finite-dimensional intuition imposes Kij decreasing with ‖xi − xj‖ ⇒ solutionsFia tend to “flatten”

I Consequence: D−α(u)

K(u,u)Dα−1(u)

' 1n

1nu1Tnu

and clustering information

vanishes (not so obvious but can be shown).

Solution:

I Forgetting finite-dimensional intuition: “recenter” K to kill flattening, i.e., use

K = PKP , P = In −1

n1n1T

n.

69 / 113

Page 154: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning improved 69/113

Resurrecting SSL by centering

Reminder:

F = argminF∈Rn×k

k∑a=1

∑i,j

Kij(Fiadα−1i − Fjadα−1

j )2 with F(l)ia = δxi∈Ca

⇔ F (u) =(Inu −D

−α(u)

K(u,u)Dα−1(u)

)−1D−α

(u)K(u,l)D

α−1(l)

F (l).

Domination of score flattening:

I Finite-dimensional intuition imposes Kij decreasing with ‖xi − xj‖ ⇒ solutionsFia tend to “flatten”

I Consequence: D−α(u)

K(u,u)Dα−1(u)

' 1n

1nu1Tnu

and clustering information

vanishes (not so obvious but can be shown).

Solution:

I Forgetting finite-dimensional intuition: “recenter” K to kill flattening, i.e., use

K = PKP , P = In −1

n1n1T

n.

69 / 113

Page 155: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning improved 69/113

Resurrecting SSL by centering

Reminder:

F = argminF∈Rn×k

k∑a=1

∑i,j

Kij(Fiadα−1i − Fjadα−1

j )2 with F(l)ia = δxi∈Ca

⇔ F (u) =(Inu −D

−α(u)

K(u,u)Dα−1(u)

)−1D−α

(u)K(u,l)D

α−1(l)

F (l).

Domination of score flattening:

I Finite-dimensional intuition imposes Kij decreasing with ‖xi − xj‖ ⇒ solutionsFia tend to “flatten”

I Consequence: D−α(u)

K(u,u)Dα−1(u)

' 1n

1nu1Tnu

and clustering information

vanishes (not so obvious but can be shown).

Solution:

I Forgetting finite-dimensional intuition: “recenter” K to kill flattening, i.e., use

K = PKP , P = In −1

n1n1T

n.

69 / 113

Page 156: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning improved 69/113

Resurrecting SSL by centering

Reminder:

F = argminF∈Rn×k

k∑a=1

∑i,j

Kij(Fiadα−1i − Fjadα−1

j )2 with F(l)ia = δxi∈Ca

⇔ F (u) =(Inu −D

−α(u)

K(u,u)Dα−1(u)

)−1D−α

(u)K(u,l)D

α−1(l)

F (l).

Domination of score flattening:

I Finite-dimensional intuition imposes Kij decreasing with ‖xi − xj‖ ⇒ solutionsFia tend to “flatten”

I Consequence: D−α(u)

K(u,u)Dα−1(u)

' 1n

1nu1Tnu

and clustering information

vanishes (not so obvious but can be shown).

Solution:

I Forgetting finite-dimensional intuition: “recenter” K to kill flattening, i.e., use

K = PKP , P = In −1

n1n1T

n.

69 / 113

Page 157: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning improved 70/113

Theoretical results

Setting

I K = 2, xi ∼ N (±µ, Ip)

I scores fu = (αInu − Kuu)−1Kulfl.

Theorem (Asymptotic mean and variance)As n→∞,

j(u)Ti fu

nui−mi

a.s.−→ 0,(fu −mi1nu )T D

(u)i (fu −mi1nu )

nui− σ2

ia.s.−→ 0

where, for i = 1, 2,

mi ≡ −cl

cusi

(1−

[1 +

cuc1c2‖µ‖2

c0

δ

1 + δ

]−1)

σ2i ≡

s2i c2l c

2i ‖µ‖2δ2

c20(1 + δ)2 − cuc0δ2

1 +cuc1c2‖µ‖2

c0

δ2

(1+δ)2(1 +

cuc1c2‖µ‖2c0

δ1+δ

)2+s2i clci

1− ciδ2

c0(1 + δ)2 − cuδ2

with δ defined as

δ ≡ −1

2+cu − c0 + sign(α)

√(α− α−)(α− α+)

2α.

70 / 113

Page 158: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning improved 70/113

Theoretical results

Setting

I K = 2, xi ∼ N (±µ, Ip)

I scores fu = (αInu − Kuu)−1Kulfl.

Theorem (Asymptotic mean and variance)As n→∞,

j(u)Ti fu

nui−mi

a.s.−→ 0,(fu −mi1nu )T D

(u)i (fu −mi1nu )

nui− σ2

ia.s.−→ 0

where, for i = 1, 2,

mi ≡ −cl

cusi

(1−

[1 +

cuc1c2‖µ‖2

c0

δ

1 + δ

]−1)

σ2i ≡

s2i c2l c

2i ‖µ‖2δ2

c20(1 + δ)2 − cuc0δ2

1 +cuc1c2‖µ‖2

c0

δ2

(1+δ)2(1 +

cuc1c2‖µ‖2c0

δ1+δ

)2+s2i clci

1− ciδ2

c0(1 + δ)2 − cuδ2

with δ defined as

δ ≡ −1

2+cu − c0 + sign(α)

√(α− α−)(α− α+)

2α.

70 / 113

Page 159: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning improved 71/113

Performance as a function of nu, nl

20 40 60 80 100

0.6

0.8

1

‖µ‖ = 1

‖µ‖ = 2

‖µ‖ = 3

cu/cl (blue), cl/cu (black)

Figure: Correct classification rate, at optimal α, as a function of (i) nu for fixed p/nl = 5 (blue)and (ii) nl for fixed p/nu = 5 (black); c1 = c2 = 1

2 ; different values for ‖µ‖. Comparison tooptimal Neyman–Pearson performance for known µ (in red).

71 / 113

Page 160: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning improved 72/113

The spike case or not (1)

Marcenko–Pastur + spike limitI limiting eigenvalue distribution is Marcenko–Pastur law

I presence of isolated spike iif

‖µ‖2 >1

c1c2

√c0

cu.

I determines existence or not of unsupervised spectral clustering solution.

0 0.5 1 1.5 2 2.5 3 3.50

0.2

0.4

0.6

spike

Empirical eigenvalues

Marcenko–Pastur law

Figure: Eigenvalue distribution of Kuu versus the (scaled) Marcenko–Pastur law with Stieltjestransform δ, for cu = 9

10 , c0 = 12 . The value ‖µ‖ = 2.5 ensures the presence of a leading isolated

eigenvalue (spike).

72 / 113

Page 161: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning improved 72/113

The spike case or not (1)

Marcenko–Pastur + spike limitI limiting eigenvalue distribution is Marcenko–Pastur law

I presence of isolated spike iif

‖µ‖2 >1

c1c2

√c0

cu.

I determines existence or not of unsupervised spectral clustering solution.

0 0.5 1 1.5 2 2.5 3 3.50

0.2

0.4

0.6

spike

Empirical eigenvalues

Marcenko–Pastur law

Figure: Eigenvalue distribution of Kuu versus the (scaled) Marcenko–Pastur law with Stieltjestransform δ, for cu = 9

10 , c0 = 12 . The value ‖µ‖ = 2.5 ensures the presence of a leading isolated

eigenvalue (spike).

72 / 113

Page 162: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning improved 72/113

The spike case or not (1)

Marcenko–Pastur + spike limitI limiting eigenvalue distribution is Marcenko–Pastur law

I presence of isolated spike iif

‖µ‖2 >1

c1c2

√c0

cu.

I determines existence or not of unsupervised spectral clustering solution.

0 0.5 1 1.5 2 2.5 3 3.50

0.2

0.4

0.6

spike

Empirical eigenvalues

Marcenko–Pastur law

Figure: Eigenvalue distribution of Kuu versus the (scaled) Marcenko–Pastur law with Stieltjestransform δ, for cu = 9

10 , c0 = 12 . The value ‖µ‖ = 2.5 ensures the presence of a leading isolated

eigenvalue (spike).

72 / 113

Page 163: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning improved 72/113

The spike case or not (1)

Marcenko–Pastur + spike limitI limiting eigenvalue distribution is Marcenko–Pastur law

I presence of isolated spike iif

‖µ‖2 >1

c1c2

√c0

cu.

I determines existence or not of unsupervised spectral clustering solution.

0 0.5 1 1.5 2 2.5 3 3.50

0.2

0.4

0.6

spike

Empirical eigenvalues

Marcenko–Pastur law

Figure: Eigenvalue distribution of Kuu versus the (scaled) Marcenko–Pastur law with Stieltjestransform δ, for cu = 9

10 , c0 = 12 . The value ‖µ‖ = 2.5 ensures the presence of a leading isolated

eigenvalue (spike).72 / 113

Page 164: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning improved 73/113

The spike case or not (2)

α+ 2.745 2.750 2.755 2.760

0.4

0.5

0.6

0.7

δ(α) = −(1 + cuc1c2c−10 ‖µ‖2)−1

α

‖µ‖ = 1.7 (< phase tr.)

‖µ‖ = 1.8 (> phase tr.)

Figure: Asymptotic correct classification probability Φ(m1σ1

)as a function of α for cu = 9

10 ,

c0 = 12 , c1 = 1

2 , two different values of ‖µ‖, below and above phase transition.

73 / 113

Page 165: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning improved 74/113

SSL: the road from supervised to unsupervised

α+ 3 3.5 4 4.5 50.59

0.6

0.61

0.62

α

3 3.5 4 4.5 5

0.75

0.8

0.85

α

Figure: Theory (solid) versus practice (dashed; from right to left: n = 400, 1000, 4000): correctclassification probability as a function of α for cu = 9

10 , c0 = 12 , c1 = 1

2 , and left: ‖µ‖ = 1.5(below phase transition); right: ‖µ‖ = 2.5 (above phase transition). Different values of n.

74 / 113

Page 166: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning improved 75/113

Experimental evidence: MNIST

Digits (0,8) (2,7) (6,9)

nu = 100

Centered kernel 89.5±3.6 89.5±3.4 85.3±5.9Iterated centered kernel 89.5±3.6 89.5±3.4 85.3±5.9

Laplacian 75.5±5.6 74.2±5.8 70.0±5.5Iterated Laplacian 87.2±4.7 86.0±5.2 81.4±6.8

Manifold 88.0±4.7 88.4±3.9 82.8±6.5

nu = 1000

Centered kernel 92.2±0.9 92.5±0.8 92.6±1.6Iterated centered kernel 92.3±0.9 92.5± 0.8 92.9±1.4

Laplacian 65.6±4.1 74.4±4.0 69.5±3.7Iterated Laplacian 92.2±0.9 92.4±0.9 92.0±1.6

Manifold 91.1±1.7 91.4±1.9 91.4±2.0

Table: Comparison of classification accuracy (%) on MNIST datasets with nl = 10. Computed over1000 random iterations for nu = 100 and 100 for nu = 1000.

75 / 113

Page 167: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Semi-supervised Learning improved 76/113

Experimental evidence: Traffic signs (HOG features)

Class ID (2,7) (9,10) (11,18)

nu = 100

Centered kernel 79.0±10.4 77.5±9.2 78.5±7.1Iterated centered kernel 85.3±5.9 89.2±5.6 90.1±6.7

Laplacian 73.8±9.8 77.3±9.5 78.6±7.2Iterated Laplacian 83.7±7.2 88.0±6.8 87.1±8.8

Manifold 77.6±8.9 81.4±10.4 82.3±10.8

nu = 1000

Centered kernel 83.6±2.4 84.6±2.4 88.7±9.4Iterated centered kernel 84.8±3.8 88.0±5.5 96.4±3.0

Laplacian 72.7±4.2 88.9±5.7 95.8±3.2Iterated Laplacian 83.0±5.5 88.2±6.0 92.7±6.1

Manifold 77.7±5.8 85.0±9.0 90.6±8.1

Table: Comparison of classification accuracy (%) on German Traffic Sign datasets with nl = 10.Computed over 1000 random iterations for nu = 100 and 100 for nu = 1000.

76 / 113

Page 168: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 77/113

Outline

Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models

ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√

p

Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs

Perspectives

77 / 113

Page 169: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 78/113

Random Feature Maps and Extreme Learning Machines

Context: Random Feature MapI (large) input x1, . . . , xT ∈ Rp

I random W =

wT1. . .wTn

∈ Rn×p

I non-linear activation function σ.

Neural Network Model (extreme learning machine): Ridge-regression learningI small output y1, . . . , yT ∈ RdI ridge-regression output β ∈ Rn×d

78 / 113

Page 170: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 78/113

Random Feature Maps and Extreme Learning Machines

Context: Random Feature MapI (large) input x1, . . . , xT ∈ Rp

I random W =

wT1. . .wTn

∈ Rn×p

I non-linear activation function σ.

Neural Network Model (extreme learning machine): Ridge-regression learningI small output y1, . . . , yT ∈ RdI ridge-regression output β ∈ Rn×d

78 / 113

Page 171: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 79/113

Random Feature Maps and Extreme Learning Machines

Objectives: evaluate training and testing MSE performance as n, p, T →∞

I Training MSE:

Etrain =1

T

T∑i=1

‖yi − βTσ(Wxi)‖2 =1

T‖Y − βTΣ‖2F

with

Σ = σ(WX) =σ(wT

i xj)

1≤i≤n1≤j≤T

β=1

(1

TΣTΣ + γIT

)−1

Y .

I Testing MSE: upon new pair (X, Y ) of length T ,

Etest =1

T‖Y − βTΣ‖2F .

where Σ = σ(WX).

79 / 113

Page 172: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 79/113

Random Feature Maps and Extreme Learning Machines

Objectives: evaluate training and testing MSE performance as n, p, T →∞I Training MSE:

Etrain =1

T

T∑i=1

‖yi − βTσ(Wxi)‖2 =1

T‖Y − βTΣ‖2F

with

Σ = σ(WX) =σ(wT

i xj)

1≤i≤n1≤j≤T

β=1

(1

TΣTΣ + γIT

)−1

Y .

I Testing MSE: upon new pair (X, Y ) of length T ,

Etest =1

T‖Y − βTΣ‖2F .

where Σ = σ(WX).

79 / 113

Page 173: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 79/113

Random Feature Maps and Extreme Learning Machines

Objectives: evaluate training and testing MSE performance as n, p, T →∞I Training MSE:

Etrain =1

T

T∑i=1

‖yi − βTσ(Wxi)‖2 =1

T‖Y − βTΣ‖2F

with

Σ = σ(WX) =σ(wT

i xj)

1≤i≤n1≤j≤T

β=1

(1

TΣTΣ + γIT

)−1

Y .

I Testing MSE: upon new pair (X, Y ) of length T ,

Etest =1

T‖Y − βTΣ‖2F .

where Σ = σ(WX).

79 / 113

Page 174: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 80/113

Technical Aspects

Preliminary observations:

I Link to resolvent of 1T

ΣTΣ:

Etrain =γ2

TtrY TY Q2 = −γ2 ∂

∂γ

1

TtrY TY Q

where Q = Q(γ) is the resolvent

Q ≡(

1

TΣTΣ + γIT

)−1

with Σij = σ(wTi xj).

Central object: resolvent E[Q].

80 / 113

Page 175: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 80/113

Technical Aspects

Preliminary observations:

I Link to resolvent of 1T

ΣTΣ:

Etrain =γ2

TtrY TY Q2 = −γ2 ∂

∂γ

1

TtrY TY Q

where Q = Q(γ) is the resolvent

Q ≡(

1

TΣTΣ + γIT

)−1

with Σij = σ(wTi xj).

Central object: resolvent E[Q].

80 / 113

Page 176: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 81/113

Main Technical Result

Theorem [Asymptotic Equivalent for E[Q]]For Lipschitz σ, bounded ‖X‖, ‖Y ‖, W = f(Z) (entry-wise) with Z standard Gaussian,we have, for all ε > 0, ∥∥E[Q]− Q

∥∥ < Cnε−12

for some C > 0, where

Q =

(n

T

Φ

1 + δ+ γIT

)−1

Φ ≡ E[σ(XTw)σ(wTX)

]with w = f(z), z ∼ N (0, Ip), and δ > 0 the unique positive solution to

δ =1

Ttr ΦQ.

Proof arguments:I σ(WX) has independent rows but dependent columnsI breaks the “trace lemma” argument (i.e., 1

pwTXAXTw ' 1

ptrXAXT)

Concentration of measure: P(∣∣∣ 1pσ(wTX)Aσ(XTw)− 1

ptr ΦA

∣∣∣ > t)≤ Ce−cnmin(t,t2)

81 / 113

Page 177: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 81/113

Main Technical Result

Theorem [Asymptotic Equivalent for E[Q]]For Lipschitz σ, bounded ‖X‖, ‖Y ‖, W = f(Z) (entry-wise) with Z standard Gaussian,we have, for all ε > 0, ∥∥E[Q]− Q

∥∥ < Cnε−12

for some C > 0, where

Q =

(n

T

Φ

1 + δ+ γIT

)−1

Φ ≡ E[σ(XTw)σ(wTX)

]with w = f(z), z ∼ N (0, Ip), and δ > 0 the unique positive solution to

δ =1

Ttr ΦQ.

Proof arguments:I σ(WX) has independent rows but dependent columnsI breaks the “trace lemma” argument (i.e., 1

pwTXAXTw ' 1

ptrXAXT)

Concentration of measure: P(∣∣∣ 1pσ(wTX)Aσ(XTw)− 1

ptr ΦA

∣∣∣ > t)≤ Ce−cnmin(t,t2)

81 / 113

Page 178: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 81/113

Main Technical Result

Theorem [Asymptotic Equivalent for E[Q]]For Lipschitz σ, bounded ‖X‖, ‖Y ‖, W = f(Z) (entry-wise) with Z standard Gaussian,we have, for all ε > 0, ∥∥E[Q]− Q

∥∥ < Cnε−12

for some C > 0, where

Q =

(n

T

Φ

1 + δ+ γIT

)−1

Φ ≡ E[σ(XTw)σ(wTX)

]with w = f(z), z ∼ N (0, Ip), and δ > 0 the unique positive solution to

δ =1

Ttr ΦQ.

Proof arguments:I σ(WX) has independent rows but dependent columnsI breaks the “trace lemma” argument (i.e., 1

pwTXAXTw ' 1

ptrXAXT)

Concentration of measure: P(∣∣∣ 1pσ(wTX)Aσ(XTw)− 1

ptr ΦA

∣∣∣ > t)≤ Ce−cnmin(t,t2)

81 / 113

Page 179: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 82/113

Main Technical Result

I Values of Φ(a, b) for w ∼ N (0, Ip),

σ(t) Φ(a, b)

max(t, 0) 12π‖a‖‖b‖

(∠(a, b) acos(−∠(a, b)) +

√1− ∠(a, b)2

)|t| 2

π‖a‖‖b‖

(∠(a, b) asin(∠(a, b)) +

√1− ∠(a, b)2

)erf(t) 2

πasin

(2aTb√

(1+2‖a‖2)(1+2‖b‖2)

)1t>0

12− 1

2πacos(∠(a, b))

sign(t) 1− 2π

acos(∠(a, b))

cos(t) exp(− 12

(‖a‖2 + ‖b‖2)) cosh(aTb).

where ∠(a, b) ≡ aTb‖a‖‖b‖ .

I Value of Φ(a, b) for wi i.i.d. with E[wki ] = mk (m1 = 0), σ(t) = ζ2t2 + ζ1t+ ζ0

Φ(a, b) = ζ22

[m2

2

(2(aTb)2 + ‖a‖2‖b‖2

)+ (m4 − 3m2

2)(a2)T(b2)]

+ ζ21m2a

Tb

+ ζ2ζ1m3

[(a2)Tb+ aT(b2)

]+ ζ2ζ0m2

[‖a‖2 + ‖b‖2

]+ ζ2

0

where (a2) ≡ [a21, . . . , a

2p]T.

82 / 113

Page 180: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 82/113

Main Technical Result

I Values of Φ(a, b) for w ∼ N (0, Ip),

σ(t) Φ(a, b)

max(t, 0) 12π‖a‖‖b‖

(∠(a, b) acos(−∠(a, b)) +

√1− ∠(a, b)2

)|t| 2

π‖a‖‖b‖

(∠(a, b) asin(∠(a, b)) +

√1− ∠(a, b)2

)erf(t) 2

πasin

(2aTb√

(1+2‖a‖2)(1+2‖b‖2)

)1t>0

12− 1

2πacos(∠(a, b))

sign(t) 1− 2π

acos(∠(a, b))

cos(t) exp(− 12

(‖a‖2 + ‖b‖2)) cosh(aTb).

where ∠(a, b) ≡ aTb‖a‖‖b‖ .

I Value of Φ(a, b) for wi i.i.d. with E[wki ] = mk (m1 = 0), σ(t) = ζ2t2 + ζ1t+ ζ0

Φ(a, b) = ζ22

[m2

2

(2(aTb)2 + ‖a‖2‖b‖2

)+ (m4 − 3m2

2)(a2)T(b2)]

+ ζ21m2a

Tb

+ ζ2ζ1m3

[(a2)Tb+ aT(b2)

]+ ζ2ζ0m2

[‖a‖2 + ‖b‖2

]+ ζ2

0

where (a2) ≡ [a21, . . . , a

2p]T.

82 / 113

Page 181: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 83/113

Main Results

Theorem [Asymptotic Etrain]For all ε > 0,

n12−ε (Etrain − Etrain

)→ 0

almost surely, where

Etrain =1

T

∥∥∥Y T − ΣTβ∥∥∥2

F=γ2

TtrY TY Q2

Etrain =γ2

TtrY TY Q

[1n

tr ΨQ2

1− 1n

tr (ΨQ)2Ψ + IT

]Q

with Ψ ≡ nT

Φ1+δ

.

83 / 113

Page 182: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 84/113

Main Results

I Letting X ∈ Rp×T , Y ∈ Rd×T satisfy “similar properties” as (X,Y ),

Claim [Asymptotic Etest]For all ε > 0,

n12−ε (Etest − Etest

)→ 0

almost surely, where

Etest =1

T

∥∥∥Y T − ΣTβ∥∥∥2

F

Etest =1

T

∥∥∥Y T −ΨTXX

QY T∥∥∥2

F

+1n

trY TY QΨQ

1− 1n

tr (ΨQ)2

[1

Ttr ΨXX −

1

Ttr (IT + γQ)(ΨXXΨXXQ)

]

with ΨAB = nT

ΦAB1+δ

, ΦAB = E[σ(ATw)σ(wTB)].

84 / 113

Page 183: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 85/113

Simulations on MNIST: Lipschitz σ(·)

10−4 10−3 10−2 10−1 100 101 102

10−1

100

σ(t) = t

γ

MS

E

Etrain

Etest

Etrain

Etest

Figure: Neural network performance for Lipschitz continuous σ(·), as a function of γ, for 2-class

MNIST data (sevens, nines), n = 512, T = T = 1024, p = 784.

85 / 113

Page 184: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 85/113

Simulations on MNIST: Lipschitz σ(·)

10−4 10−3 10−2 10−1 100 101 102

10−1

100

σ(t) = t

γ

MS

E

Etrain

Etest

Etrain

Etest

Figure: Neural network performance for Lipschitz continuous σ(·), as a function of γ, for 2-class

MNIST data (sevens, nines), n = 512, T = T = 1024, p = 784.

85 / 113

Page 185: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 85/113

Simulations on MNIST: Lipschitz σ(·)

10−4 10−3 10−2 10−1 100 101 102

10−1

100

σ(t) = erf(t)

σ(t) = t

γ

MS

E

Etrain

Etest

Etrain

Etest

Figure: Neural network performance for Lipschitz continuous σ(·), as a function of γ, for 2-class

MNIST data (sevens, nines), n = 512, T = T = 1024, p = 784.

85 / 113

Page 186: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 85/113

Simulations on MNIST: Lipschitz σ(·)

10−4 10−3 10−2 10−1 100 101 102

10−1

100

σ(t) = erf(t)

σ(t) = |t|σ(t) = t

γ

MS

E

Etrain

Etest

Etrain

Etest

Figure: Neural network performance for Lipschitz continuous σ(·), as a function of γ, for 2-class

MNIST data (sevens, nines), n = 512, T = T = 1024, p = 784.

85 / 113

Page 187: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 85/113

Simulations on MNIST: Lipschitz σ(·)

10−4 10−3 10−2 10−1 100 101 102

10−1

100

σ(t) = max(t, 0)

σ(t) = erf(t)

σ(t) = |t|σ(t) = t

γ

MS

E

Etrain

Etest

Etrain

Etest

Figure: Neural network performance for Lipschitz continuous σ(·), as a function of γ, for 2-class

MNIST data (sevens, nines), n = 512, T = T = 1024, p = 784.

85 / 113

Page 188: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 86/113

Simulations on MNIST: non Lipschitz σ(·)

10−4 10−3 10−2 10−1 100 101 102

10−1

100

σ(t) = sign(t)

σ(t) = 1t>0

σ(t) = 1− 12t2

γ

MS

E

Etrain

Etest

Etrain

Etest

Figure: Neural network performance for σ(·) either discontinuous or non Lipschitz, as a function of

γ, for 2-class MNIST data (sevens, nines), n = 512, T = T = 1024, p = 784.

86 / 113

Page 189: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 87/113

Deeper investigation on Φ

Statistical Assumptions on XI Gaussian mixture model

xi ∈ Ca ⇔ xi ∼ N (1√pµa,

1

pCa).

I Growth rate: ‖µa‖ = O(1), 1√p

trCa = O(1).

TheoremAs p, T →∞, for all σ(·) given in next table,

‖PΦP − P ΦP‖ a.s.−→ 0

with

Φ ≡ d1

(Ω +M

JT

√p

)T (Ω +M

JT

√p

)+ d2UBU

T + d0IT

U ≡[J√p, φ]

B ≡[ttT + 2T t

tT 1

]and d0, d1, d2 given in next table (φi = ‖wi‖2 − E[‖wi‖2] for xi = 1√

pµa + wi).

87 / 113

Page 190: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 87/113

Deeper investigation on Φ

Statistical Assumptions on XI Gaussian mixture model

xi ∈ Ca ⇔ xi ∼ N (1√pµa,

1

pCa).

I Growth rate: ‖µa‖ = O(1), 1√p

trCa = O(1).

TheoremAs p, T →∞, for all σ(·) given in next table,

‖PΦP − P ΦP‖ a.s.−→ 0

with

Φ ≡ d1

(Ω +M

JT

√p

)T (Ω +M

JT

√p

)+ d2UBU

T + d0IT

U ≡[J√p, φ]

B ≡[ttT + 2T t

tT 1

]and d0, d1, d2 given in next table (φi = ‖wi‖2 − E[‖wi‖2] for xi = 1√

pµa + wi).

87 / 113

Page 191: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 88/113

Deeper investigation on Φ

Figure: Coefficients di in Φ for different σ(·).

σ(t) d0 d1 d2

t 0 1 0ReLU(t)

(14 −

12π

)τ 1

41

8πτ|t|

(1− 2

π

)τ 0 1

2πτ

LReLU(t) π−24π (ς+ + ς−)2τ 1

4 (ς+ − ς−)2 18τπ (ς+ + ς−)2

1t>014 −

12π

12πτ 0

sign(t) 1− 2π

2πτ 0

ς2t2 + ς1t+ ς0 2τ2ς22 ς21 ς22

cos(t) 12 + e−2τ

2 − e−τ 0 e−τ4

sin(t) 12 −

e−2τ

2 − τe−τ e−τ 0

erf(t) 2π

(arccos

(2τ

2τ+1

)− 2τ

2τ+1

)4π

12τ+1 0

exp(− t22 ) 1√2τ+1

− 1τ+1 0 1

4(τ+1)3

where

I ReLU(t) = max(t, 0)

I LReLU(t) = ς+ max(t, 0) + ς−max(−t, 0).

88 / 113

Page 192: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 89/113

Deeper investigation on Φ

Three groups of functions σ(·) emerge:I “means-oriented”: d2 = 0I “covariance-oriented”: d1 = 0I “balanced”: d1, d2 6= 0

Case of the Leaky–ReLUI σ(t) = ς+ max(t, 0) + ς−max(−t, 0)

ς+ = ς− = 1

C1 C2 C3 C4 C1 C2 C3 C4

ς+ = ς− = 1

ς+ = 1, ς− = 0

Figure: Eigenvectors 1 and 2 of PΦP for: N (µ1, C1), N (µ1, C2), N (µ2, C1), N (µ2, C2)

89 / 113

Page 193: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 89/113

Deeper investigation on Φ

Three groups of functions σ(·) emerge:I “means-oriented”: d2 = 0I “covariance-oriented”: d1 = 0I “balanced”: d1, d2 6= 0

Case of the Leaky–ReLUI σ(t) = ς+ max(t, 0) + ς−max(−t, 0)

ς+ = ς− = 1

C1 C2 C3 C4 C1 C2 C3 C4

ς+ = ς− = 1

ς+ = 1, ς− = 0

Figure: Eigenvectors 1 and 2 of PΦP for: N (µ1, C1), N (µ1, C2), N (µ2, C1), N (µ2, C2)

89 / 113

Page 194: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 89/113

Deeper investigation on Φ

Three groups of functions σ(·) emerge:I “means-oriented”: d2 = 0I “covariance-oriented”: d1 = 0I “balanced”: d1, d2 6= 0

Case of the Leaky–ReLUI σ(t) = ς+ max(t, 0) + ς−max(−t, 0)

ς+ = ς− = 1

C1 C2 C3 C4 C1 C2 C3 C4

ς+ = ς− = 1

ς+ = 1, ς− = 0

Figure: Eigenvectors 1 and 2 of PΦP for: N (µ1, C1), N (µ1, C2), N (µ2, C1), N (µ2, C2)89 / 113

Page 195: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 90/113

Depper investigation on Φ: Simulation results

Table: Clustering accuracies for different σ(t) on MNIST dataset (n = 32).

σ(t) T = 32 T = 64 T = 128

mean-oriented

t 85.31% 88.94% 87.30%1t>0 86.00% 82.94% 85.56%

sign(t) 81.94% 83.34% 85.22%sin(t) 85.31% 87.81% 87.50%erf(t) 86.50% 87.28% 86.59%

cov-oriented

|t| 62.81% 60.41% 57.81%cos(t) 62.50% 59.56% 57.72%

exp(− t22 ) 64.00% 60.44% 58.67%

balanced (t) 82.87% 85.72% 82.27%

90 / 113

Page 196: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Random Feature Maps, Extreme Learning Machines, and Neural Networks 91/113

Depper investigation on Φ: Simulation results

Table: Clustering accuracies for different σ(t) on epileptic EEG dataset (n = 32).

σ(t) T = 32 T = 64 T = 128

mean-oriented

t 71.81% 70.31% 69.58%1t>0 65.19% 65.87% 63.47%

sign(t) 67.13% 64.63% 63.03%sin(t) 71.94% 70.34% 68.22%erf(t) 69.44% 70.59% 67.70%

cov-oriented

|t| 99.69% 99.69% 99.50%cos(t) 99.00% 99.38% 99.36%

exp(− t22 ) 99.81% 99.81% 99.77%

balanced (t) 84.50% 87.91% 90.97%

91 / 113

Page 197: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Community Detection on Graphs 92/113

Outline

Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models

ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√

p

Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs

Perspectives

92 / 113

Page 198: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Community Detection on Graphs 93/113

System Setting

Undirected graph with n nodes, m edges:I “intrinsic” average connectivity q1, . . . , qn ∼ µ i.i.d.

I k classes C1, . . . , Ck independent of qi of (large) sizes n1, . . . , nk, withpreferential attachment Cab between Ca and Cb

I edge probability for nodes i ∈ Cgi :P (i ∼ j) = qiqjCgigj .

I adjacency matrix A with

Aij ∼ Bernoulli(qiqjCgigj )

93 / 113

Page 199: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Community Detection on Graphs 93/113

System Setting

Undirected graph with n nodes, m edges:I “intrinsic” average connectivity q1, . . . , qn ∼ µ i.i.d.I k classes C1, . . . , Ck independent of qi of (large) sizes n1, . . . , nk, with

preferential attachment Cab between Ca and Cb

I edge probability for nodes i ∈ Cgi :P (i ∼ j) = qiqjCgigj .

I adjacency matrix A with

Aij ∼ Bernoulli(qiqjCgigj )

93 / 113

Page 200: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Community Detection on Graphs 93/113

System Setting

Undirected graph with n nodes, m edges:I “intrinsic” average connectivity q1, . . . , qn ∼ µ i.i.d.I k classes C1, . . . , Ck independent of qi of (large) sizes n1, . . . , nk, with

preferential attachment Cab between Ca and CbI edge probability for nodes i ∈ Cgi :

P (i ∼ j) = qiqjCgigj .

I adjacency matrix A with

Aij ∼ Bernoulli(qiqjCgigj )

93 / 113

Page 201: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Community Detection on Graphs 93/113

System Setting

Undirected graph with n nodes, m edges:I “intrinsic” average connectivity q1, . . . , qn ∼ µ i.i.d.I k classes C1, . . . , Ck independent of qi of (large) sizes n1, . . . , nk, with

preferential attachment Cab between Ca and CbI edge probability for nodes i ∈ Cgi :

P (i ∼ j) = qiqjCgigj .

I adjacency matrix A with

Aij ∼ Bernoulli(qiqjCgigj )93 / 113

Page 202: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Community Detection on Graphs 94/113

Limitations of Classical Methods

I 3 classes with µ bi-modal (µ = 34δ0.1 + 1

4δ0.5)

(Modularity A− ddT

2m) (Bethe Hessian D − rA)

94 / 113

Page 203: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Community Detection on Graphs 94/113

Limitations of Classical Methods

I 3 classes with µ bi-modal (µ = 34δ0.1 + 1

4δ0.5)

(Modularity A− ddT

2m) (Bethe Hessian D − rA)

94 / 113

Page 204: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Community Detection on Graphs 95/113

Proposed Regularized Modularity Approach

Recall: P (i ∼ j) = qiqjCgigj .

Dense Regime Assumptions: Non trivial regime when, ∀a, b, as n→∞,

Cab = 1 +Mab√n, Mab = O(1).

Community information is weak but highly redundant

Considered Matrix:

Lα = (2m)α1√nD−α

[A−

ddT

2m

]D−α.

95 / 113

Page 205: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Community Detection on Graphs 95/113

Proposed Regularized Modularity Approach

Recall: P (i ∼ j) = qiqjCgigj .

Dense Regime Assumptions: Non trivial regime when, ∀a, b, as n→∞,

Cab = 1 +Mab√n, Mab = O(1).

Community information is weak but highly redundant

Considered Matrix:

Lα = (2m)α1√nD−α

[A−

ddT

2m

]D−α.

95 / 113

Page 206: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Community Detection on Graphs 95/113

Proposed Regularized Modularity Approach

Recall: P (i ∼ j) = qiqjCgigj .

Dense Regime Assumptions: Non trivial regime when, ∀a, b, as n→∞,

Cab = 1 +Mab√n, Mab = O(1).

Community information is weak but highly redundant

Considered Matrix:

Lα = (2m)α1√nD−α

[A−

ddT

2m

]D−α.

95 / 113

Page 207: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Community Detection on Graphs 95/113

Proposed Regularized Modularity Approach

Recall: P (i ∼ j) = qiqjCgigj .

Dense Regime Assumptions: Non trivial regime when, ∀a, b, as n→∞,

Cab = 1 +Mab√n, Mab = O(1).

Community information is weak but highly redundant

Considered Matrix:

Lα = (2m)α1√nD−α

[A−

ddT

2m

]D−α.

95 / 113

Page 208: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Community Detection on Graphs 96/113

Asymptotic Equivalence

Theorem (Limiting Random Matrix Equivalent)As n→∞, ‖Lα − Lα‖

a.s.−→ 0, where

Lα = (2m)α1√nD−α

[A−

ddT

2m

]D−α

Lα =1√nD−αq XD−αq + UΛUT

with Dq = diag(qi), X zero-mean random matrix with variance profile,

U =[D1−αq

J√n

D−αq X1n], rank k + 1

Λ =

[(Ik − 1kc

T)M(Ik − c1Tk) −1k

1Tk 0

]and J = [j1, . . . , jk], ja = [0, . . . , 0, 1T

na, 0, . . . , 0]T ∈ Rn.

Consequences:I isolated eigenvalues beyond phase transition ⇔ λ(M) >“spectrum edge”

Optimal choice αopt of α from study of limiting spectrum.

I eigenvectors correlated to D1−αq J

Necessary regularization by Dα−1.

96 / 113

Page 209: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Community Detection on Graphs 96/113

Asymptotic Equivalence

Theorem (Limiting Random Matrix Equivalent)As n→∞, ‖Lα − Lα‖

a.s.−→ 0, where

Lα = (2m)α1√nD−α

[A−

ddT

2m

]D−α

Lα =1√nD−αq XD−αq + UΛUT

with Dq = diag(qi), X zero-mean random matrix with variance profile,

U =[D1−αq

J√n

D−αq X1n], rank k + 1

Λ =

[(Ik − 1kc

T)M(Ik − c1Tk) −1k

1Tk 0

]and J = [j1, . . . , jk], ja = [0, . . . , 0, 1T

na, 0, . . . , 0]T ∈ Rn.

Consequences:I isolated eigenvalues beyond phase transition ⇔ λ(M) >“spectrum edge”

Optimal choice αopt of α from study of limiting spectrum.

I eigenvectors correlated to D1−αq J

Necessary regularization by Dα−1.

96 / 113

Page 210: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Community Detection on Graphs 96/113

Asymptotic Equivalence

Theorem (Limiting Random Matrix Equivalent)As n→∞, ‖Lα − Lα‖

a.s.−→ 0, where

Lα = (2m)α1√nD−α

[A−

ddT

2m

]D−α

Lα =1√nD−αq XD−αq + UΛUT

with Dq = diag(qi), X zero-mean random matrix with variance profile,

U =[D1−αq

J√n

D−αq X1n], rank k + 1

Λ =

[(Ik − 1kc

T)M(Ik − c1Tk) −1k

1Tk 0

]and J = [j1, . . . , jk], ja = [0, . . . , 0, 1T

na, 0, . . . , 0]T ∈ Rn.

Consequences:I isolated eigenvalues beyond phase transition ⇔ λ(M) >“spectrum edge”

Optimal choice αopt of α from study of limiting spectrum.

I eigenvectors correlated to D1−αq J

Necessary regularization by Dα−1.

96 / 113

Page 211: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Community Detection on Graphs 97/113

Eigenvalue Spectrum

−6 −4 −2 0 2 4 6

−Sα Sα

spikes

Eigenvalues of L1 (n = 2000)

Limiting law

Figure: 3 classes, c1 = c2 = 0.3, c3 = 0.4, µ = 12δ0.4 + 1

2δ0.9, M = 4

3 −1 −1−1 3 −1−1 −1 3

.

97 / 113

Page 212: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Community Detection on Graphs 98/113

Phase Transition

Theorem (Phase Transition)Isolated eigenvalue λi(Lα) if |λi(M)| > τα, M = (D(c)− ccT)M , where

τα = limx↓Sα+

−1

gα(x), phase transition threshold

with [Sα−, Sα+] limiting eigenvalue support of Lα and gα(x) (|x| > Sα+) solution of

fα(x) =

∫q1−2α

−x− q1−2αfα(x) + q2−2αgα(x)µ(dq)

gα(x) =

∫q2−2α

−x− q1−2αfα(x) + q2−2αgα(x)µ(dq).

In this case, λi(Lα)a.s.−→ (gα)−1

(−1/λi(M)

).

Clustering possible when λi(M) > (minα τα):

I “Optimal” αopt ≡ argminα τα.I From qi ≡ di√

dT1n

a.s.−→ qi, µ ' µ ≡ 1n

∑ni=1 δqi and thus:

Consistent estimator αopt of αopt.

98 / 113

Page 213: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Community Detection on Graphs 98/113

Phase Transition

Theorem (Phase Transition)Isolated eigenvalue λi(Lα) if |λi(M)| > τα, M = (D(c)− ccT)M , where

τα = limx↓Sα+

−1

gα(x), phase transition threshold

with [Sα−, Sα+] limiting eigenvalue support of Lα and gα(x) (|x| > Sα+) solution of

fα(x) =

∫q1−2α

−x− q1−2αfα(x) + q2−2αgα(x)µ(dq)

gα(x) =

∫q2−2α

−x− q1−2αfα(x) + q2−2αgα(x)µ(dq).

In this case, λi(Lα)a.s.−→ (gα)−1

(−1/λi(M)

).

Clustering possible when λi(M) > (minα τα):

I “Optimal” αopt ≡ argminα τα.

I From qi ≡ di√dT1n

a.s.−→ qi, µ ' µ ≡ 1n

∑ni=1 δqi and thus:

Consistent estimator αopt of αopt.

98 / 113

Page 214: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Community Detection on Graphs 98/113

Phase Transition

Theorem (Phase Transition)Isolated eigenvalue λi(Lα) if |λi(M)| > τα, M = (D(c)− ccT)M , where

τα = limx↓Sα+

−1

gα(x), phase transition threshold

with [Sα−, Sα+] limiting eigenvalue support of Lα and gα(x) (|x| > Sα+) solution of

fα(x) =

∫q1−2α

−x− q1−2αfα(x) + q2−2αgα(x)µ(dq)

gα(x) =

∫q2−2α

−x− q1−2αfα(x) + q2−2αgα(x)µ(dq).

In this case, λi(Lα)a.s.−→ (gα)−1

(−1/λi(M)

).

Clustering possible when λi(M) > (minα τα):

I “Optimal” αopt ≡ argminα τα.I From qi ≡ di√

dT1n

a.s.−→ qi, µ ' µ ≡ 1n

∑ni=1 δqi and thus:

Consistent estimator αopt of αopt.

98 / 113

Page 215: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Community Detection on Graphs 99/113

Simulated Performance Results (2 masses of qi)

(Modularity A− ddT

2m) (Bethe Hessian D − rA)

(Proposed, α = 1) (Proposed, αopt)

Figure: 3 classes, µ = 34δ0.1 + 1

4δ0.5, c1 = c2 = 14 , c3 = 1

2 , M = 100I3.

99 / 113

Page 216: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Community Detection on Graphs 99/113

Simulated Performance Results (2 masses of qi)

(Modularity A− ddT

2m) (Bethe Hessian D − rA)

(Proposed, α = 1)

(Proposed, αopt)

Figure: 3 classes, µ = 34δ0.1 + 1

4δ0.5, c1 = c2 = 14 , c3 = 1

2 , M = 100I3.

99 / 113

Page 217: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Community Detection on Graphs 99/113

Simulated Performance Results (2 masses of qi)

(Modularity A− ddT

2m) (Bethe Hessian D − rA)

(Proposed, α = 1) (Proposed, αopt)

Figure: 3 classes, µ = 34δ0.1 + 1

4δ0.5, c1 = c2 = 14 , c3 = 1

2 , M = 100I3.

99 / 113

Page 218: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Community Detection on Graphs 100/113

Simulated Performance Results (2 masses for qi)

10 20 30 40 500

0.2

0.4

0.6

0.8

1

Ove

rla

p

α = 1

Phase transition

Figure: Overlap performance for n = 3000, K = 3, ci = 13 , µ = 3

4 δq(1)+ 1

4 δq(2)with

q(1) = 0.1 and q(2) = 0.5, M = ∆I3, for ∆ ∈ [5, 50]. Here αopt = 0.07.

100 / 113

Page 219: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Community Detection on Graphs 100/113

Simulated Performance Results (2 masses for qi)

10 20 30 40 500

0.2

0.4

0.6

0.8

1

Ove

rla

p

α = 0

α = 12

α = 1

Phase transition

Figure: Overlap performance for n = 3000, K = 3, ci = 13 , µ = 3

4 δq(1)+ 1

4 δq(2)with

q(1) = 0.1 and q(2) = 0.5, M = ∆I3, for ∆ ∈ [5, 50]. Here αopt = 0.07.

100 / 113

Page 220: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Community Detection on Graphs 100/113

Simulated Performance Results (2 masses for qi)

10 20 30 40 500

0.2

0.4

0.6

0.8

1

Ove

rla

p

α = 0

α = 12

α = 1

α = αopt

Phase transition

Figure: Overlap performance for n = 3000, K = 3, ci = 13 , µ = 3

4 δq(1)+ 1

4 δq(2)with

q(1) = 0.1 and q(2) = 0.5, M = ∆I3, for ∆ ∈ [5, 50]. Here αopt = 0.07.

100 / 113

Page 221: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Community Detection on Graphs 100/113

Simulated Performance Results (2 masses for qi)

10 20 30 40 500

0.2

0.4

0.6

0.8

1

Ove

rla

p

α = 0

α = 12

α = 1

α = αopt

Bethe Hessian

Phase transition

Figure: Overlap performance for n = 3000, K = 3, ci = 13 , µ = 3

4 δq(1)+ 1

4 δq(2)with

q(1) = 0.1 and q(2) = 0.5, M = ∆I3, for ∆ ∈ [5, 50]. Here αopt = 0.07.

100 / 113

Page 222: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Community Detection on Graphs 101/113

Simulated Performance Results (2 masses for qi)

0.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1

q2 (q1 = 0.1)

Ove

rla

p

α = 0

α = 0.5

α = 1

α = αopt

Bethe Hessian

Figure: Overlap performance for n = 3000, K = 3, µ = 34 δq(1)

+ 14 δq(2)

with q(1) = 0.1 and

q(2) ∈ [0.1, 0.9], M = 10(2I3 − 131T3), ci = 1

3 .

101 / 113

Page 223: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Applications/Community Detection on Graphs 102/113

Real Graph Example: PolBlogs (n = 1490, two classes)

−1 −0.5 0 0.5 1

L0

−100 0 100

L1

(λmax ' 1.75) (λmax ' 483)

Algorithms Overlap Modularityαopt (' 0) 0.897 0.4246α = 0.5 0.035 ' 0α = 1 0.040 ' 0BH 0.304 0.2723

102 / 113

Page 224: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Perspectives/ 103/113

Outline

Basics of Random Matrix TheoryMotivation: Large Sample Covariance MatricesSpiked Models

ApplicationsReminder on Spectral Clustering MethodsKernel Spectral ClusteringKernel Spectral Clustering: The case f ′(τ) = 0Kernel Spectral Clustering: The case f ′(τ) = α√

p

Semi-supervised LearningSemi-supervised Learning improvedRandom Feature Maps, Extreme Learning Machines, and Neural NetworksCommunity Detection on Graphs

Perspectives

103 / 113

Page 225: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Perspectives/ 104/113

Summary of Results and Perspectives I

Random Neural Networks.

4 Extreme learning machines (one-layer random NN)

4 Linear echo-state networks (ESN)

. Logistic regression and classification error in extreme learning machines (ELM)

. Further random feature maps characterization

. Generalized random NN (multiple layers, multiple activations)

. Random convolutional networks for image processing

­ Non-linear ESN

Deep Neural Networks (DNN).

. Backpropagation in NN (σ(WX) for random X, backprop. on W )

­ Statistical physics-inspired approaches (spin-glass models, Hamiltonian-basedmodels)

­ Non-linear ESN

DNN performance of physics-realistic models (4th-order Hamiltonian, locality)

104 / 113

Page 226: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Perspectives/ 105/113

Summary of Results and Perspectives II

References.

H. W. Lin, M. Tegmark, “Why does deep and cheap learning work so well?”,

arXiv:1608.08225v2, 2016.

C. Williams, “Computation with infinite neural networks”, Neural Computation, 10(5),

1203-1216, 1998.

Herbert Jaeger. Short term memory in echo state networks. GMD-Forschungszentrum

Informationstechnik, 2001.

Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew, “Extreme learning machine : theory

and applications”, Neurocomputing, 70(1) :489501, 2006.

N. El Karoui, “Concentration of measure and spectra of random matrices: applications to

correlation matrices, elliptical distributions and beyond”, The Annals of Applied Probability,19(6), 2362-2405, 2009.

C. Louart, Z. Liao, R. Couillet, ”A Random Matrix Approach to Neural Networks”, (submitted

to) Annals of Applied Probability, 2017.

R. Couillet, G. Wainrib, H. Sevi, H. Tiomoko Ali, “The asymptotic performance of linear echo

state neural networks”, Journal of Machine Learning Research, vol. 17, no. 178, pp. 1-35, 2016.

Choromanska, Anna, et al. ”The Loss Surfaces of Multilayer Networks.” AISTATS. 2015.

Rahimi, Ali, and Benjamin Recht. ”Random Features for Large-Scale Kernel Machines.” NIPS.

Vol. 3. No. 4. 2007.

105 / 113

Page 227: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Perspectives/ 106/113

Summary of Results and Perspectives I

Kernel methods.

4 Spectral clustering

4 Subspace spectral clustering (f ′(τ) = 0)

. Spectral clustering with outer product kernel f(xTy)

4 Semi-supervised learning, kernel approaches.

4 Least square support vector machines (LS-SVM).

. Support vector machines (SVM).

­ Kernel matrices based on Kendall τ , Spearman ρ.

Applications.

4 Massive MIMO user subspace clustering (patent proposed)

­ Kernel correlation matrices for biostats, heterogeneous datasets.

­ Kernel PCA.

­ Kendall τ in biostats.

References.

N. El Karoui, “The spectrum of kernel random matrices”, The Annals of Statistics, 38(1),

1-50, 2010.

106 / 113

Page 228: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Perspectives/ 107/113

Summary of Results and Perspectives II

R. Couillet, F. Benaych-Georges, “Kernel Spectral Clustering of Large Dimensional Data”,

Electronic Journal of Statistics, vol. 10, no. 1, pp. 1393-1454, 2016.

R. Couillet, A. Kammoun, “Random Matrix Improved Subspace Clustering”, Asilomar

Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 2016.

Z. Liao, R. Couillet, ”A Large Dimensional Analysis of Least Squares Support Vector

Machines”, (submitted to) Journal of Machine Learning Research, 2017.

X. Mai, R. Couillet, “The counterintuitive mechanism of graph-based semi-supervised learning

in the big data regime”, IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP’17), New Orleans, USA, 2017.

107 / 113

Page 229: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Perspectives/ 108/113

Summary of Results and Perspectives I

Community detection.

4 Heterogeneous dense network clustering.

. Semi-supervised clustering.

­ Sparse network extensions.

­ Beyond community detection (hub detection).

Applications.

4 Improved methods for community detection.

. Applications to distributed optimization (network diffusion, graph signalprocessing).

References.

H. Tiomoko Ali, R. Couillet, “Spectral community detection in heterogeneous large networks”,

(submitted to) Journal of Multivariate Analysis, 2016.

F. Krzakala, C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborova, P. Zhang, “Spectral

redemption in clustering sparse networks. Proceedings of the National Academy of Sciences”,110(52), 20935-20940, 2013.

C. Bordenave, M. Lelarge, L. Massoulie, “Non-backtracking spectrum of random graphs:

community detection and non-regular Ramanujan graphs”, Foundations of Computer Science(FOCS), 2015 IEEE 56th Annual Symposium on, pp. 1347-1357, 2015

A. Saade, F. Krzakala, L. Zdeborova, “Spectral clustering of graphs with the Bethe Hessian”,

In Advances in Neural Information Processing Systems, pp. 406-414, 2014.

108 / 113

Page 230: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Perspectives/ 109/113

Summary of Results and Perspectives I

Robust statistics.

4 Tyler, Maronna (and regularized) estimators

4 Elliptical data setting, deterministic outlier setting

4 Central limit theorem extensions

­ Joint mean and covariance robust estimation

­ Robust regression (preliminary works exist already using strikingly differentapproaches)

Applications.

4 Statistical finance (portfolio estimation)

4 Localisation in array processing (robust GMUSIC)

4 Detectors in space time array processing

­ Correlation matrices in biostatistics, human science datasets, etc.

References.

R. Couillet, F. Pascal, J. W. Silverstein, “Robust Estimates of Covariance Matrices in the

Large Dimensional Regime”, IEEE Transactions on Information Theory, vol. 60, no. 11, pp.7269-7278, 2014.

109 / 113

Page 231: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Perspectives/ 110/113

Summary of Results and Perspectives II

R. Couillet, F. Pascal, J. W. Silverstein, “The Random Matrix Regime of Maronna’s

M-estimator with elliptically distributed samples”, Elsevier Journal of Multivariate Analysis, vol.139, pp. 56-78, 2015.

T. Zhang, X. Cheng, A. Singer, “Marchenko-Pastur Law for Tyler’s and Maronna’s

M-estimators”, arXiv:1401.3424, 2014.

R. Couillet, M. McKay, “Large Dimensional Analysis and Optimization of Robust Shrinkage

Covariance Matrix Estimators”, Elsevier Journal of Multivariate Analysis, vol. 131, pp. 99-120,2014.

D. Morales-Jimenez, R. Couillet, M. McKay, “Large Dimensional Analysis of Robust

M-Estimators of Covariance with Outliers”, IEEE Transactions on Signal Processing, vol. 63,no. 21, pp. 5784-5797, 2015.

L. Yang, R. Couillet, M. McKay, “A Robust Statistics Approach to Minimum Variance Portfolio

Optimization”, IEEE Transactions on Signal Processing, vol. 63, no. 24, pp. 6684–6697, 2015.

R. Couillet, “Robust spiked random matrices and a robust G-MUSIC estimator”, Elsevier

Journal of Multivariate Analysis, vol. 140, pp. 139-161, 2015.

A. Kammoun, R. Couillet, F. Pascal, M.-S. Alouini, “Optimal Design of the Adaptive

Normalized Matched Filter Detector”, (submitted to) IEEE Transactions on InformationTheory, 2016, arXiv Preprint 1504.01252.

110 / 113

Page 232: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Perspectives/ 111/113

Summary of Results and Perspectives III

R. Couillet, A. Kammoun, F. Pascal, “Second order statistics of robust estimators of scatter.

Application to GLRT detection for elliptical signals”, Elsevier Journal of Multivariate Analysis,vol. 143, pp. 249-274, 2016.

D. Donoho, A. Montanari, “High dimensional robust m-estimation: Asymptotic variance via

approximate message passing”, Probability Theory and Related Fields, 1-35, 2013.

N. El Karoui, “Asymptotic behavior of unregularized and ridge-regularized high-dimensional

robust regression estimators: rigorous results.” arXiv preprint arXiv:1311.2445, 2013.

111 / 113

Page 233: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Perspectives/ 112/113

Summary of Results and Perspectives I

Other works and ideas.

4 Spike random matrix sparse PCA

. Non-linear shrinkage methods

. Sparse kernel PCA

. Random signal processing on graph methods.

. Random matrix analysis of diffusion networks performance.

Applications.

4 Spike factor models in portfolio optimization

. Non-linear shrinkage in portfolio optimization, biostats

References.

R. Couillet, M. McKay, “Optimal block-sparse PCA for high dimensional correlated samples”,

(submitted to) Journal of Multivariate Analysis, 2016.

J. Bun, J. P. Bouchaud, M. Potters, “On the overlaps between eigenvectors of correlated

random matrices”, arXiv preprint arXiv:1603.04364 (2016).

Ledoit, O. and Wolf, M., “Nonlinear shrinkage estimation of large-dimensional covariance

matrices”, 2011

112 / 113

Page 234: Random Matrices in Machine Learning - AfIABasics of Random Matrix Theory/Motivation: Large Sample Covariance Matrices 7/113 The Mar cenko{Pastur law De nition (Empirical Spectral Density)

Perspectives/ 113/113

The End

Thank you.

113 / 113