non-gaussian component analysis - abhishek shetty · abhishek shetty joint work with navin goyal...
TRANSCRIPT
Non-Gaussian Component Analysis
Abhishek Shetty
Joint Work withNavin Goyal
Microsoft Research India
December 20, 2017
Motivation
I A problem that arises naturally in the data sciences is looking formeaningful low dimensional structure in high dimensional probabilityspaces.
I One such problem is the following :Let X ∈ Rn be a random variable and let A ∈ Rn×d be a lineartransformation such that n ≥ d . Consider an independent randomvariable η representing noise and let Y = AX + η. We would like tobe able to study X .
Non-Gaussian Component Analysis Microsoft Research India
Motivation
I A problem that arises naturally in the data sciences is looking formeaningful low dimensional structure in high dimensional probabilityspaces.
I One such problem is the following :Let X ∈ Rn be a random variable and let A ∈ Rn×d be a lineartransformation such that n ≥ d . Consider an independent randomvariable η representing noise and let Y = AX + η. We would like tobe able to study X .
Non-Gaussian Component Analysis Microsoft Research India
Gaussian as a Model for Noise I
I What is a reasonable model for the noise? That, of course, dependson the problem at hand. But, a generally reasonable model for noiseis the Normal or the Gaussian distribution
I The standard Gaussian in R is a random variable with density givenby
φ(x) =1√2π
e−x2
2
I Why is the Gaussian a natural model for noise?
I One answer is the central limit theorem.
TheoremLet X1 . . .Xn be IID random variables with mean 0 and variance 1. Then,
Zn =X1 + . . .Xn√
n
w→ N(0, 1)
Non-Gaussian Component Analysis Microsoft Research India
Gaussian as a Model for Noise I
I What is a reasonable model for the noise? That, of course, dependson the problem at hand. But, a generally reasonable model for noiseis the Normal or the Gaussian distribution
I The standard Gaussian in R is a random variable with density givenby
φ(x) =1√2π
e−x2
2
I Why is the Gaussian a natural model for noise?
I One answer is the central limit theorem.
TheoremLet X1 . . .Xn be IID random variables with mean 0 and variance 1. Then,
Zn =X1 + . . .Xn√
n
w→ N(0, 1)
Non-Gaussian Component Analysis Microsoft Research India
Gaussian as a Model for Noise I
I What is a reasonable model for the noise? That, of course, dependson the problem at hand. But, a generally reasonable model for noiseis the Normal or the Gaussian distribution
I The standard Gaussian in R is a random variable with density givenby
φ(x) =1√2π
e−x2
2
I Why is the Gaussian a natural model for noise?
I One answer is the central limit theorem.
TheoremLet X1 . . .Xn be IID random variables with mean 0 and variance 1. Then,
Zn =X1 + . . .Xn√
n
w→ N(0, 1)
Non-Gaussian Component Analysis Microsoft Research India
Gaussian as a Model for Noise I
I What is a reasonable model for the noise? That, of course, dependson the problem at hand. But, a generally reasonable model for noiseis the Normal or the Gaussian distribution
I The standard Gaussian in R is a random variable with density givenby
φ(x) =1√2π
e−x2
2
I Why is the Gaussian a natural model for noise?
I One answer is the central limit theorem.
TheoremLet X1 . . .Xn be IID random variables with mean 0 and variance 1. Then,
Zn =X1 + . . .Xn√
n
w→ N(0, 1)
Non-Gaussian Component Analysis Microsoft Research India
Gaussian as a Model for Noise II
I But, why is the Gaussian the point of convergence? Any su�cientlycorrect answer would prove the CLT.
I But, as a �rst check, note that any point of convergence must be a�xed point under the process Zn. Indeed, the Gaussian is the unique�xed point for the process (for distributions of �nite variance). Thatis, for Yi ∼ N(0, 1), we have
Y1 + · · ·+ Yn√n
∼ N(0, 1)
I Thus, under reasonable conditions on the noise (�nite variance andthe noise is in equilibrium), the Gaussian is the unique distributionthat models the behavior.
Non-Gaussian Component Analysis Microsoft Research India
Gaussian as a Model for Noise II
I But, why is the Gaussian the point of convergence? Any su�cientlycorrect answer would prove the CLT.
I But, as a �rst check, note that any point of convergence must be a�xed point under the process Zn. Indeed, the Gaussian is the unique�xed point for the process (for distributions of �nite variance). Thatis, for Yi ∼ N(0, 1), we have
Y1 + · · ·+ Yn√n
∼ N(0, 1)
I Thus, under reasonable conditions on the noise (�nite variance andthe noise is in equilibrium), the Gaussian is the unique distributionthat models the behavior.
Non-Gaussian Component Analysis Microsoft Research India
Gaussian as a Model for Noise II
I But, why is the Gaussian the point of convergence? Any su�cientlycorrect answer would prove the CLT.
I But, as a �rst check, note that any point of convergence must be a�xed point under the process Zn. Indeed, the Gaussian is the unique�xed point for the process (for distributions of �nite variance). Thatis, for Yi ∼ N(0, 1), we have
Y1 + · · ·+ Yn√n
∼ N(0, 1)
I Thus, under reasonable conditions on the noise (�nite variance andthe noise is in equilibrium), the Gaussian is the unique distributionthat models the behavior.
Non-Gaussian Component Analysis Microsoft Research India
Problem Statement
De�nitionWe say that a random variable X ∈ Rn is isotropic if EX = 0 andEXXT = In.
I Equivalently, all projections of the random variable onto the spherehave unit variance.
De�nition (Tan-Vershynin'17, Blanchard et al'06)We say that a random variable X ∈ Rn follows the isotropic NGCA modelif X = (X̃ ,Z ) ∈ V ⊕ V⊥ where Z ∼ N(0, I ) and X is isotropic.
I Aim: To approximate V from samples from X .
I Note : The problem described earlier itself doesn't impose theisotropy condition, but this can be achieved easily by whitening thedata.
Non-Gaussian Component Analysis Microsoft Research India
Problem Statement
De�nitionWe say that a random variable X ∈ Rn is isotropic if EX = 0 andEXXT = In.
I Equivalently, all projections of the random variable onto the spherehave unit variance.
De�nition (Tan-Vershynin'17, Blanchard et al'06)We say that a random variable X ∈ Rn follows the isotropic NGCA modelif X = (X̃ ,Z ) ∈ V ⊕ V⊥ where Z ∼ N(0, I ) and X is isotropic.
I Aim: To approximate V from samples from X .
I Note : The problem described earlier itself doesn't impose theisotropy condition, but this can be achieved easily by whitening thedata.
Non-Gaussian Component Analysis Microsoft Research India
Problem Statement
De�nitionWe say that a random variable X ∈ Rn is isotropic if EX = 0 andEXXT = In.
I Equivalently, all projections of the random variable onto the spherehave unit variance.
De�nition (Tan-Vershynin'17, Blanchard et al'06)We say that a random variable X ∈ Rn follows the isotropic NGCA modelif X = (X̃ ,Z ) ∈ V ⊕ V⊥ where Z ∼ N(0, I ) and X is isotropic.
I Aim: To approximate V from samples from X .
I Note : The problem described earlier itself doesn't impose theisotropy condition, but this can be achieved easily by whitening thedata.
Non-Gaussian Component Analysis Microsoft Research India
Projection Pursuit
I A natural approach to the problem is a local projection basedapproach i.e. considering the random variables 〈X , a〉 for a ∈ Sn−1.One advantage is that the these are now random variables in R, forwhich estimation is considerably simpler.
I To measure the progress, we need a "contrast" function to measuredistance from the Gaussian.
I Finding a non-Gaussian direction then reduces to maximizing thecontrast function along the sphere, which we aim to achieve through�rst order methods.
Non-Gaussian Component Analysis Microsoft Research India
Projection Pursuit
I A natural approach to the problem is a local projection basedapproach i.e. considering the random variables 〈X , a〉 for a ∈ Sn−1.One advantage is that the these are now random variables in R, forwhich estimation is considerably simpler.
I To measure the progress, we need a "contrast" function to measuredistance from the Gaussian.
I Finding a non-Gaussian direction then reduces to maximizing thecontrast function along the sphere, which we aim to achieve through�rst order methods.
Non-Gaussian Component Analysis Microsoft Research India
Projection Pursuit
I A natural approach to the problem is a local projection basedapproach i.e. considering the random variables 〈X , a〉 for a ∈ Sn−1.One advantage is that the these are now random variables in R, forwhich estimation is considerably simpler.
I To measure the progress, we need a "contrast" function to measuredistance from the Gaussian.
I Finding a non-Gaussian direction then reduces to maximizing thecontrast function along the sphere, which we aim to achieve through�rst order methods.
Non-Gaussian Component Analysis Microsoft Research India
Entropy as a Measure of �Gaussianness� II To �nd the non-Gaussian subspace, we need a measure of how far a
given distribution is from Gaussian. Entropy is one naturalcandidate.
I The Shannon Entropy Functional in the discrete case.
H(P) = −∑
p(x) log p(x)
Entropy tries to capture the amount of randomness in a distribution.For example, a fair coin has 1 bit of entropy, while a biased coin hasless than 1 bit of entropy. Roughly corresponds to the averagenumber of bits needed to code samples from a distribution.
I A natural (maybe not) generalization to random variables supportedon R is the di�erential entropy
h(X ) = −∫
p(x) log p(x)dx
Non-Gaussian Component Analysis Microsoft Research India
Entropy as a Measure of �Gaussianness� II To �nd the non-Gaussian subspace, we need a measure of how far a
given distribution is from Gaussian. Entropy is one naturalcandidate.
I The Shannon Entropy Functional in the discrete case.
H(P) = −∑
p(x) log p(x)
Entropy tries to capture the amount of randomness in a distribution.For example, a fair coin has 1 bit of entropy, while a biased coin hasless than 1 bit of entropy. Roughly corresponds to the averagenumber of bits needed to code samples from a distribution.
I A natural (maybe not) generalization to random variables supportedon R is the di�erential entropy
h(X ) = −∫
p(x) log p(x)dx
Non-Gaussian Component Analysis Microsoft Research India
Entropy as a Measure of �Gaussianness� II To �nd the non-Gaussian subspace, we need a measure of how far a
given distribution is from Gaussian. Entropy is one naturalcandidate.
I The Shannon Entropy Functional in the discrete case.
H(P) = −∑
p(x) log p(x)
Entropy tries to capture the amount of randomness in a distribution.For example, a fair coin has 1 bit of entropy, while a biased coin hasless than 1 bit of entropy. Roughly corresponds to the averagenumber of bits needed to code samples from a distribution.
I A natural (maybe not) generalization to random variables supportedon R is the di�erential entropy
h(X ) = −∫
p(x) log p(x)dx
Non-Gaussian Component Analysis Microsoft Research India
Entropy as a Measure of �Gaussianness� II To �nd the non-Gaussian subspace, we need a measure of how far a
given distribution is from Gaussian. Entropy is one naturalcandidate.
I The Shannon Entropy Functional in the discrete case.
H(P) = −∑
p(x) log p(x)
Entropy tries to capture the amount of randomness in a distribution.For example, a fair coin has 1 bit of entropy, while a biased coin hasless than 1 bit of entropy. Roughly corresponds to the averagenumber of bits needed to code samples from a distribution.
I A natural (maybe not) generalization to random variables supportedon R is the di�erential entropy
h(X ) = −∫
p(x) log p(x)dx
Non-Gaussian Component Analysis Microsoft Research India
Entropy as a Measure of �Gaussianness� II
I Di�erential entropy is not as nicely behaved as its discretecounterpart, (the di�erential entropy can be negative and it is notinvariant under scaling) but it still shares several key properties suchas tensorization and translation invariance.
I Entropy is of enormous signi�cance in probability theory, informationtheory, physics and computer science.
I Another important property of the Gaussian distribution is that forrandom variables of a given variance σ, the Gaussian N(0, σ)maximizes the di�erential entropy.
I The fact above also points us towards an interpretation of thecentral limit theorems as versions of the Second Law ofThermodynamics. This view has in fact been fruitful in �ndingbounds on the convergence rates for CLT.
Non-Gaussian Component Analysis Microsoft Research India
Entropy as a Measure of �Gaussianness� II
I Di�erential entropy is not as nicely behaved as its discretecounterpart, (the di�erential entropy can be negative and it is notinvariant under scaling) but it still shares several key properties suchas tensorization and translation invariance.
I Entropy is of enormous signi�cance in probability theory, informationtheory, physics and computer science.
I Another important property of the Gaussian distribution is that forrandom variables of a given variance σ, the Gaussian N(0, σ)maximizes the di�erential entropy.
I The fact above also points us towards an interpretation of thecentral limit theorems as versions of the Second Law ofThermodynamics. This view has in fact been fruitful in �ndingbounds on the convergence rates for CLT.
Non-Gaussian Component Analysis Microsoft Research India
Entropy as a Measure of �Gaussianness� II
I Di�erential entropy is not as nicely behaved as its discretecounterpart, (the di�erential entropy can be negative and it is notinvariant under scaling) but it still shares several key properties suchas tensorization and translation invariance.
I Entropy is of enormous signi�cance in probability theory, informationtheory, physics and computer science.
I Another important property of the Gaussian distribution is that forrandom variables of a given variance σ, the Gaussian N(0, σ)maximizes the di�erential entropy.
I The fact above also points us towards an interpretation of thecentral limit theorems as versions of the Second Law ofThermodynamics. This view has in fact been fruitful in �ndingbounds on the convergence rates for CLT.
Non-Gaussian Component Analysis Microsoft Research India
Entropy as a Measure of �Gaussianness� II
I Di�erential entropy is not as nicely behaved as its discretecounterpart, (the di�erential entropy can be negative and it is notinvariant under scaling) but it still shares several key properties suchas tensorization and translation invariance.
I Entropy is of enormous signi�cance in probability theory, informationtheory, physics and computer science.
I Another important property of the Gaussian distribution is that forrandom variables of a given variance σ, the Gaussian N(0, σ)maximizes the di�erential entropy.
I The fact above also points us towards an interpretation of thecentral limit theorems as versions of the Second Law ofThermodynamics. This view has in fact been fruitful in �ndingbounds on the convergence rates for CLT.
Non-Gaussian Component Analysis Microsoft Research India
Quick Overview of the Algorithm
I Pick a point at random from the sphere.
I Project the random variable on the direction.
I Compute the entropy and its gradient and take a gradient ascentstep. Repeat till the entropy is small.
I Eliminate direction found and repeat in lower dimensional space.
Theorem (informal)There exists an algorithm that given access to samples from a randomvariable in the isotropic NGCA model, outputs a subspace thatapproximates the non-Gaussian subspace. The running time is polynomialin dimension (and parameters).
Non-Gaussian Component Analysis Microsoft Research India
Quick Overview of the Algorithm
I Pick a point at random from the sphere.
I Project the random variable on the direction.
I Compute the entropy and its gradient and take a gradient ascentstep. Repeat till the entropy is small.
I Eliminate direction found and repeat in lower dimensional space.
Theorem (informal)There exists an algorithm that given access to samples from a randomvariable in the isotropic NGCA model, outputs a subspace thatapproximates the non-Gaussian subspace. The running time is polynomialin dimension (and parameters).
Non-Gaussian Component Analysis Microsoft Research India
Connections
I The approach taken brings up several interesting connections toprobability and other branches of the mathematics.
I Even in the simple two dimensional case, the behavior of entropy inthe problem is related to the (adjoint) Ornstein-Ulhenbeck process,which in turn relates to several important inequalities such as thehypercontractivity inequalities and the logarithmic Sobolevinequalities.
I The gradient �ow of entropy is a problem that is studied in variouscontexts. This has interesting connections to optimal transport anddynamical systems.
Non-Gaussian Component Analysis Microsoft Research India
Connections
I The approach taken brings up several interesting connections toprobability and other branches of the mathematics.
I Even in the simple two dimensional case, the behavior of entropy inthe problem is related to the (adjoint) Ornstein-Ulhenbeck process,which in turn relates to several important inequalities such as thehypercontractivity inequalities and the logarithmic Sobolevinequalities.
I The gradient �ow of entropy is a problem that is studied in variouscontexts. This has interesting connections to optimal transport anddynamical systems.
Non-Gaussian Component Analysis Microsoft Research India
Connections
I The approach taken brings up several interesting connections toprobability and other branches of the mathematics.
I Even in the simple two dimensional case, the behavior of entropy inthe problem is related to the (adjoint) Ornstein-Ulhenbeck process,which in turn relates to several important inequalities such as thehypercontractivity inequalities and the logarithmic Sobolevinequalities.
I The gradient �ow of entropy is a problem that is studied in variouscontexts. This has interesting connections to optimal transport anddynamical systems.
Non-Gaussian Component Analysis Microsoft Research India
Further Directions
I Finding other characterizations of the Gaussian is of mathematicaland algorithmic interest.
I Improving the assumptions on the random variable.
Non-Gaussian Component Analysis Microsoft Research India
Thank You
Questions?
Non-Gaussian Component Analysis Microsoft Research India