self-organizing maps

Self-Organizing Maps

• Projection of p dimensional observations to a two (or one) dimensional grid space

• Constraint version of K-means clustering– Prototypes lie in a one- or two-dimensional manifold (constrained

topological map; Teuvo Kohonen, 1993)

• K prototypes: Rectangular grid, hexagonal grid

• Integer pair lj Q1 x Q2, where Q1=1, …, q1 & Q2=1,…,q2

(K = q1 x q2)

• High-dimensional observations projected to the two-dimensional coordinate system

SOM Algorithm

• Prototype mj, j =1, …, K, are initialized

• Each observation xi is processed one at a time to find the closest prototype mj in Euclidean distance in the p-dimensional space

• All neighbors of mj, say mk, move toward xi asmk mk + xi – mk

• Neighbors are all mk such that the distance between mj and mk are smaller than a threshold r (neighbor includes itself)

– Distance defined on Q1 x Q2, not on the p-dimensional space

• SOM performance depends on learning rate and threshold r

– Typically, and r are decreased from 1 to 0 and from R (predefined value) to 1 at each iteration over, say, 3000 iterations

SOM properties

• If r is small enough, each neighbor contains only one point spatial connection between prototypes is lost converges at a local minima of K-means clustering

• Need to check the constraint reasonable: compute and compare reconstruction error =||x-m||2 for both methods (SOM’s is bigger, but should be similar)

Tamayo et al. (1999; GeneCluster)

• Self-organizing maps (SOM) on microarray data

• - Hematopoietic cell lines (HL60, U937, Jurkat, and NB4): 4x3 SOM

• - Yeast data in Eisen et al. reanalyzed by 6x5 SOM

Principal Component Analysis

• Data xi, i=1,…,n, are from the p-dimensional space (n p)– Data matrix: Xnxp

• Singular decomposition X = UVT, where– is a non-negative diagonal matrix with decreasing diagonal

entries of eigen values (or singular value) i,

– Unxp with orthogonal columns (uituj = 1 if ij, =0 if i=j), and

– Vpxp is an orthogonal matrix

• The principal components are the columns of XV (=U– X and V have the same rank, at most p of non-zero eigen values

PCA properties

• The first column of XV or DU is the 1st principal component, which represents the direction with the largest variance (the first eigen value represents its magnitude)

• The second column is for the second largest variance uncorrelated with the first, and so on.

• The first q columns, q < p, of XV are the linear projection of X into q diensions with the largest variance

• Let x = UqVT, where q is the diagonal matrix of with q non-zero diagonals x is best possible approximate of X with rank q

Traditional PCA

• Variance-Covariance matrix S from data Xnxp

– Eigen value decomposition: S = CDCT, with C an orthogonal matrix

– (n-1)S = XTX = (UVT)T UVT = VUT UVT = V2VT

– Thus, D = n-1) andC = V

Principal Curves and Surfaces

• Let f() be a parameterized smooth curve on the p-dimensional space

• For data x, let f(x) define the closest point on the curve to x

• Then f() is the principal curve for random vector X, if f() = E[X| f(X) = ]

• Thus, f() is the average of all data points that project to it

Algorithm for finding the principal curve

• Let f() have its coordinate f() = [f1(), f2(), …, fp()] where random vector X = [X1, X2, …, Xp]

• Then iterate the following alternating steps until converge: – (a) fj() E[Xj|(X) = l], j =1, …, p

– (b) f(x) argmin||x – f()||2

• The solution is the principal curve for the distribution of X

Multidimensional scaling (MDS)

• Observations x1, x2, …, xn in the p-dimensional space with all pair-wise distances (or dissimilarity measure) dij

• MDS tries to preserve the structure of the original pair-wise distances as much as possible

• Then, seek the vectors z1, z2, …, zn in the k-dimensional space (k << p) by minimizing “stress function”– SD(z1,…,zn) = ij [(dij - ||zi – zj||)2]1/2

– Kruskal-Shephard scaling (or least squares)

– A gradient descent algorithm is used to find the solution

Other MDS approaches

• Sammon mapping minimizes– ij [(dij - ||zi – zj||)2] / dij

• Classical scaling is based on similarity measure sij

– Often inner product sij = <xi-, xj-> is used

– Then, minimize i j [(sij - <zi-, zj->)2]

self-organizing maps

Documents