independent component analysis

Independent Component Analysis (ICA)

Method Review

INDEPENDENT COMPONENT ANALYSIS

DATA

Imagine that you are a weaver, and you have a loom of colorful strings. Each string represents a unique pattern in the data. With actual data, each of these strings would be a vector of numbers that can be fit with a linear equation. As we see the strings above, they are well organized.

MIXED UP DATA

INDEPENDENCE


Unfortunately, when we collect data in the real world it does not come to us neat and organized. Our unique strings get mixed up with other strings, and random signal such as noise. In our example above, a monkey has come along and mixed up our strings. How do we untangle them?

HOW TO UNMIX?


We could know something special about each string, maybe a feature like color, and manually unmix, however it we are dealing with a huge dataset and don’t have a clue about any special features, we are powerless. This is where ICA comes in. We start with our mixed data and assume 1) we have mixed up data (our loom) that is 2) comprised of independent signals

MIXED STRINGS

(OBSERVED DATA)


=

MONKEY

MADNESS

(“MIXING MATRIX”)

X = A S X

ORIGINAL

STRINGS

(ORIGINAL DATA)

X

We start with this mixed up data, X, and we know that it was generated by the monkey applying some sequence of movements to it (the “monkey madness”). We call this series of transformations that the monkey applies to the unmixed data, s our mixing matrix. This matrix would consist of vectors of numbers that, when multiplied with s, produce the observed data X.


S = A-1

X X

To solve this problem and recover our original strings from the mixed ones, we just need to solve this equation for s. We know X, so we just need to figure out what the inverse of A is. This is normally referred to as “W” or the un-mixing matrix. We are going to choose the numbers in this matrix that maximize the probability of our data.

MIXED STRINGS

(OBSERVED DATA)

=

UN-MIXING

MATRIX, W

ORIGINAL

STRINGS

(ORIGINAL DATA)

X


S = A-1

X X

What is basically done is that we model the CDF of each signal’s probability as the sigmoid function because it increases from 0 to 1, the derivative of the sigmoid is the density function, and then we would iteratively maximize that function until convergence to find the weights, this inverse matrix (details in next slides!)

MIXED STRINGS

(OBSERVED DATA)

=

UN-MIXING

MATRIX, W

ORIGINAL

STRINGS

(ORIGINAL DATA)

X

Independent Component Analysis How to find the weights with Maximum Likelihood Estimation?

Suppose that the distribution of each source si is given by a density ps, and that the joint distribution of the sources s is given by:

this implies the following density on x = As = W1s

All that remains is to specify a density (a CDF) for the individual sources ps. It can’t be Gaussian, how about sigmoid? (increases from 0 to 1)

CS229 Notes, Andrew Ng, 2012

So we model the CDF for each independent signal with sigmoid, so to get the probability of the signal at any particular time-point we look at the derivative of the CDF (the PDF): So if we want to maximize this probability (find our data), we want to make it as big as possible. The square matrix W is the parameter in our model, so given a training set, the log likelihood is given by: And we want to maximize this in terms of W. It’s useful to know that: And so a “one at a time” (stochastic gradient ascent rule) is: This is how we would update our weights until convergence.

Independent Component Analysis How to find the weights with Maximum Likelihood Estimation?


FastICA Modification “ICA with Reference” is a modification of FastICA


Negative entropy is used to measure mutual independence in formula: 1st term: Gaussian variable (wTx), 2nd non-quadratic contrast function ||w||2 = 1 used when maximizing J(y) such that: If we choose, for the 2nd function G’’’(u) = u3, the update becomes: “Inspired” by this form of the update, we can impose an additional constraint that incorporates prior information about the components so it no longer maximizes just independence, but is also close to the reference, r:

ICA CAVEATS

Permutation of the original sources is ambiguous But this doesn’t matter for most applications

Data assumed to be non-Gaussian If the data is Gaussian, there is an arbitrary rotational component in the

mixing matrix that cannot be determined from the data, so we cannot

recover the original sources

No way to recover scaling of the weights If a single column of matrix A were scaled by a factor of 2 and the

corresponding source were scaled by a factor of ½, then there is again no way, given only the x(i)’s, to determine this had happened.

Why can’t the data be Gaussian?

“Suppose we observe some x = As, where A is our mixing matrix. The distribution of x will also be Gaussian, with zero mean and covariance E[xxT ] = E[AssTAT ] = AAT Now, let R be an arbitrary orthogonal (less formally, a rotation/reflection) matrix, so that RRT = RTR = I, and let A’ = AR. Then if the data had been mixed according to A’ instead of A, we would have instead observed x’ = A’s. The distribution of x’ is also Gaussian, with zero mean and covariance E[x’(x’)T ] = E[A’ssT (A’)T ] = E[ARssT (AR)T ] = ARRTAT = AAT Hence, whether the mixing matrix is A or A’, we would observe data from a N(0;AAT ) distribution. Thus, there is no way to tell if the sources were mixed using A and A’. So, there is an arbitrary rotational component in the mixing matrix that cannot be determined from the data, and we cannot recover the original sources.”


[email protected]

independent component analysis

Science