non-gaussian component analysis - abhishek shetty · abhishek shetty joint work with navin goyal...

Non-Gaussian Component Analysis

Abhishek Shetty

Joint Work withNavin Goyal

Microsoft Research India

December 20, 2017

Motivation

I A problem that arises naturally in the data sciences is looking formeaningful low dimensional structure in high dimensional probabilityspaces.

I One such problem is the following :Let X ∈ Rn be a random variable and let A ∈ Rn×d be a lineartransformation such that n ≥ d . Consider an independent randomvariable η representing noise and let Y = AX + η. We would like tobe able to study X .

Non-Gaussian Component Analysis Microsoft Research India

Motivation

I A problem that arises naturally in the data sciences is looking formeaningful low dimensional structure in high dimensional probabilityspaces.

I One such problem is the following :Let X ∈ Rn be a random variable and let A ∈ Rn×d be a lineartransformation such that n ≥ d . Consider an independent randomvariable η representing noise and let Y = AX + η. We would like tobe able to study X .

Gaussian as a Model for Noise I

I What is a reasonable model for the noise? That, of course, dependson the problem at hand. But, a generally reasonable model for noiseis the Normal or the Gaussian distribution

I The standard Gaussian in R is a random variable with density givenby

φ(x) =1√2π

e−x2

I Why is the Gaussian a natural model for noise?

I One answer is the central limit theorem.

TheoremLet X1 . . .Xn be IID random variables with mean 0 and variance 1. Then,

Zn =X1 + . . .Xn√

w→ N(0, 1)

φ(x) =1√2π

e−x2

Zn =X1 + . . .Xn√

w→ N(0, 1)

φ(x) =1√2π

e−x2

Zn =X1 + . . .Xn√

w→ N(0, 1)

φ(x) =1√2π

e−x2

Zn =X1 + . . .Xn√

w→ N(0, 1)

Gaussian as a Model for Noise II

I But, why is the Gaussian the point of convergence? Any su�cientlycorrect answer would prove the CLT.

I But, as a �rst check, note that any point of convergence must be a�xed point under the process Zn. Indeed, the Gaussian is the unique�xed point for the process (for distributions of �nite variance). Thatis, for Yi ∼ N(0, 1), we have

Y1 + · · ·+ Yn√n

∼ N(0, 1)

I Thus, under reasonable conditions on the noise (�nite variance andthe noise is in equilibrium), the Gaussian is the unique distributionthat models the behavior.

Y1 + · · ·+ Yn√n

∼ N(0, 1)

Y1 + · · ·+ Yn√n

∼ N(0, 1)

Problem Statement

De�nitionWe say that a random variable X ∈ Rn is isotropic if EX = 0 andEXXT = In.

I Equivalently, all projections of the random variable onto the spherehave unit variance.

De�nition (Tan-Vershynin'17, Blanchard et al'06)We say that a random variable X ∈ Rn follows the isotropic NGCA modelif X = (X̃ ,Z ) ∈ V ⊕ V⊥ where Z ∼ N(0, I ) and X is isotropic.

I Aim: To approximate V from samples from X .

I Note : The problem described earlier itself doesn't impose theisotropy condition, but this can be achieved easily by whitening thedata.

Problem Statement

Projection Pursuit

I A natural approach to the problem is a local projection basedapproach i.e. considering the random variables 〈X , a〉 for a ∈ Sn−1.One advantage is that the these are now random variables in R, forwhich estimation is considerably simpler.

I To measure the progress, we need a "contrast" function to measuredistance from the Gaussian.

I Finding a non-Gaussian direction then reduces to maximizing thecontrast function along the sphere, which we aim to achieve through�rst order methods.

Projection Pursuit

Entropy as a Measure of �Gaussianness� II To �nd the non-Gaussian subspace, we need a measure of how far a

given distribution is from Gaussian. Entropy is one naturalcandidate.

I The Shannon Entropy Functional in the discrete case.

H(P) = −∑

p(x) log p(x)

Entropy tries to capture the amount of randomness in a distribution.For example, a fair coin has 1 bit of entropy, while a biased coin hasless than 1 bit of entropy. Roughly corresponds to the averagenumber of bits needed to code samples from a distribution.

I A natural (maybe not) generalization to random variables supportedon R is the di�erential entropy

h(X ) = −∫

p(x) log p(x)dx

H(P) = −∑

p(x) log p(x)

h(X ) = −∫

p(x) log p(x)dx

H(P) = −∑

p(x) log p(x)

h(X ) = −∫

p(x) log p(x)dx

H(P) = −∑

p(x) log p(x)

h(X ) = −∫

p(x) log p(x)dx

Entropy as a Measure of �Gaussianness� II

I Di�erential entropy is not as nicely behaved as its discretecounterpart, (the di�erential entropy can be negative and it is notinvariant under scaling) but it still shares several key properties suchas tensorization and translation invariance.

I Entropy is of enormous signi�cance in probability theory, informationtheory, physics and computer science.

I Another important property of the Gaussian distribution is that forrandom variables of a given variance σ, the Gaussian N(0, σ)maximizes the di�erential entropy.

I The fact above also points us towards an interpretation of thecentral limit theorems as versions of the Second Law ofThermodynamics. This view has in fact been fruitful in �ndingbounds on the convergence rates for CLT.

Quick Overview of the Algorithm

I Pick a point at random from the sphere.

I Project the random variable on the direction.

I Compute the entropy and its gradient and take a gradient ascentstep. Repeat till the entropy is small.

I Eliminate direction found and repeat in lower dimensional space.

Theorem (informal)There exists an algorithm that given access to samples from a randomvariable in the isotropic NGCA model, outputs a subspace thatapproximates the non-Gaussian subspace. The running time is polynomialin dimension (and parameters).

Quick Overview of the Algorithm

I Pick a point at random from the sphere.

I Project the random variable on the direction.

I Compute the entropy and its gradient and take a gradient ascentstep. Repeat till the entropy is small.

I Eliminate direction found and repeat in lower dimensional space.

Theorem (informal)There exists an algorithm that given access to samples from a randomvariable in the isotropic NGCA model, outputs a subspace thatapproximates the non-Gaussian subspace. The running time is polynomialin dimension (and parameters).

Connections

I The approach taken brings up several interesting connections toprobability and other branches of the mathematics.

I Even in the simple two dimensional case, the behavior of entropy inthe problem is related to the (adjoint) Ornstein-Ulhenbeck process,which in turn relates to several important inequalities such as thehypercontractivity inequalities and the logarithmic Sobolevinequalities.

I The gradient �ow of entropy is a problem that is studied in variouscontexts. This has interesting connections to optimal transport anddynamical systems.

Connections

Further Directions

I Finding other characterizations of the Gaussian is of mathematicaland algorithmic interest.

I Improving the assumptions on the random variable.

Thank You

Questions?

non-gaussian component analysis - abhishek shetty · abhishek shetty joint work with navin goyal...

Documents