cz5225: modeling and simulation in biology lecture 5: clustering analysis for microarray data iii...

31
CZ5225: Modeling and Simulation in CZ5225: Modeling and Simulation in Biology Biology Lecture 5: Clustering Analysis for Lecture 5: Clustering Analysis for Microarray Data III Microarray Data III Prof. Chen Yu Zong Prof. Chen Yu Zong Tel: 6874-6877 Tel: 6874-6877 Email: Email: [email protected] [email protected] http://xin.cz3.nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, NUS Room 07-24, level 7, SOC1, NUS

Upload: maryann-ramsey

Post on 29-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

CZ5225: Modeling and Simulation in BiologyCZ5225: Modeling and Simulation in Biology

Lecture 5: Clustering Analysis for Microarray Data IIILecture 5: Clustering Analysis for Microarray Data III

Prof. Chen Yu ZongProf. Chen Yu Zong

Tel: 6874-6877Tel: 6874-6877Email: Email: [email protected]@cz3.nus.edu.sg

http://xin.cz3.nus.edu.sghttp://xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1, NUSRoom 07-24, level 7, SOC1, NUS

22

Self-Organizing MapsSelf-Organizing Maps

• Based on the work of Kohonen on learning/memory in the human brain

• As with k-means, the number of clusters need to be specified

• Moreover, a topology needs also be specified – a 2D grid that gives the geometric relationships between the clusters (i.e., which clusters should be near or distant from each other)

• The algorithm learns a mapping from the high dimensional space of the data points onto the points of the 2D grid (there is one grid point for each cluster)

33

Self Organizing MapsSelf Organizing Maps

• Creates a map in which similar patterns are plotted next to each other

• Data visualization technique that reduces n dimensions and displays similarities

• More complex than k-means or hierarchical clustering, but more meaningful

• Neural Network Technique– Inspired by the brain

44

Self Organizing Maps (SOM)Self Organizing Maps (SOM)

• Each unit of the SOM has a weighted connection to all inputs

• As the algorithm progresses, neighboring units are grouped by similarity

Input Layer

Output Layer

NN 4NN 4 55

Biological MotivationBiological Motivation

Nearby areas of the cortex correspond to related brain functions

66

The brain maps the external multidimensional representation of the world into a similar 1 or 2 - dimensional internal representation.

That is, the brain processes the external signals in a topology-preserving way

Mimicking the way the brain learns, our system should be able to do the same thing.

Brain’s self-organizationBrain’s self-organization

77

A Self-Organized MapA Self-Organized MapA Self-Organized MapA Self-Organized Map

Data: vectors XT = (X1, ... Xd) from d-dimensional space.

Grid of nodes, with local processor (called neuron) in each node.

Local processor # j has d adaptive parameters W(j).

Goal: change W(j) parameters to recover data clusters in X space.

88

SOM NetworkSOM Network

• Unsupervised learning neural network

• Projects high-dimensional input data onto two-dimensional output map

• Preserves the topology of the input data

• Visualizes structures and clusters of the data

c

i 1iw

3iw

4iw

5iw

1cw 2cw

3cw 4cw

5cw

Input layer Output layer

Component 1

Component 3

Component 5

Component 2

Component 4

2iw

99

- input vector is represented by scalar signals x1to xn:X = (x1 … xn)

- every unit “i” in competitive layer has a weight vector associated with it, represented by variable parameters wi1to win:

w = (wi1... win)- we compute the total input to each neurode by taking the weighted sum of the input signal:

n

si = wij xj j = 1

-every weight vector may be regarded as a kind of image that shall be matched or compared against a corresponding input vector;

our aim is to devise adaptive processes in which weight of all units converge to such values that every unit “i” becomes sensitive to a particular region of domain

SOM AlgorithmSOM Algorithm

1010

- geometrically, the weighted sum is simply a dot (scalar) product ofthe input vector and the weight vector:si=x*wi = x1 wi1 +... + xn win

SOM AlgorithmSOM Algorithm

X

X

1111

… 2-d map of nodes3x4 SOM

Data Data array

Input vector

Weights

Node weights of the 3x4 SOM

Self-organizing

ikkc mx minarg

)]()([)()()()1( tttttt iii mxmm

Find the winner:

Update the weights:

xk

mi

SOM Algorithm

1212

SOM AlgorithmSOM Algorithm

• Learning Algorithm

1. Initialize w’s

2. Find the winning node

i(x) = argminj || x(n) - wj(n) ||

3. Update weights of neighbors

wj(n+1) = wj(n) + (n) j,i(x)(n) [ x(n) - wj(n) ]

4. Reduce neighbors and

5. Go to 2

1313

SOM Training processSOM Training processSOM Training processSOM Training process

o

o

oox

x

xx=dane

siatka neuronów

N-wymiarowa

xo=pozycje wag neuronów

o

o o

o

o

o

o

o

przestrzeń danych

wagi wskazująna punkty w N-D

w 2-D

Nearest neighbor vectors are clustered into the same node

1414

Concept of SOMConcept of SOMInput spaceInput layer

Reduced feature spaceMap layer

s1s2Mn

Sr

Ba

Clustering and ordering of the cluster centers in a two dimensional grid

Cluster centers (code vectors) Place of these code vectors in the reduced space

1515

Ba

Mn

Sr

SA3

It can be used for visualizationo

r used

for classificatio

nMg

Or used for clustering

SA3

Concept of SOMConcept of SOM

1616

SOM ArchitectureSOM Architecture

• The input is connected with each neuron of a lattice.The input is connected with each neuron of a lattice.• The topology of the lattice allows one to define a The topology of the lattice allows one to define a

neighborhood structure on the neurons, like those illustrated neighborhood structure on the neurons, like those illustrated below.below.

2D topology2D topology

and two possible neighborhoods

with a small neighborhood1D topology1D topology

1717

Self-Organizing Maps (SOMs)Self-Organizing Maps (SOMs)

a dbc

Idea: Place genes onto a grid so that genes with similar patterns of expression are placed on nearby squares.

A

D

B

C

1818

Self-Organizing Maps (SOMs)Self-Organizing Maps (SOMs)

a dbc

IDEA: Place genes onto a grid so that genes with similar patterns of expression are placed on nearby squares.

A

D

B

C

1919

Gene 1Gene 2Gene 3Gene 4Gene 5Gene 6Gene 7Gene 8Gene 9Gene 10-Gene 11Gene 12Gene 13Gene 14Gene 15Gene 16

a_1hr a_2hr a_3hr b_1hr b_2hr b_3hr1 2 4 5 7 92 3 7 7 6 34 4 5 5 4 43 4 3 4 3 31 2 3 4 5 68 7 7 6 5 34 4 4 4 5 45 6 5 4 3 23 3 1 3 6 82 4 8 5 4 21 5 6 9 8 71 3 5 8 8 64 3 3 4 5 69 7 5 3 2 11 2 2 3 4 41 2 5 7 8 9

A

B

C

D

E

F

G

H

I

A

B

C

D

E

F

G

H

I

A

B

C

D

E

F

G

H

I

A

B

C

D

E

F

G

H

I A

B

C

D

E

F

G

H

I

Self-organizing Maps (SOMs)

2020

Self-organizing Maps (SOMS)

A

B

C

D

E

F

G

H

I

Genes , , and1 16 5

Genes and 6 14Genes and 9 13

Genes and 4, 7 2

Genes 3

Gene 15 Genes 8

Genes 10

Genes and 11 12

2121

Self-Organizing MapsSelf-Organizing Maps

• Suppose we have a r x s grid with each grid point associated with a cluster mean 1,1,… r,s

• SOM algorithm moves the cluster means around in the high dimensional space, maintaining the topology specified by the 2D grid (think of a rubber sheet)

• A data point is put into the cluster with the closest mean

• The effect is that nearby data points tend to map to nearby clusters (grid points)

2222

A Simple Example of Self-Organizing MapA Simple Example of Self-Organizing Map

This is a 4 x 3 SOM and the mean of each cluster is displayed

2323

SOM Applied to Microarray AnalysisSOM Applied to Microarray Analysis

• Consider clustering 10,000 genes

• Each gene was measured in 4 experiments– Input vectors are 4 dimensional– Initial pattern of 10,000 each described by a 4D vector

• Each of the 10,000 genes is chosen one at a time to train the SOM

2424

SOM Applied to Microarray AnalysisSOM Applied to Microarray Analysis

• The pattern found to be closest to the current gene (determined by weight vectors) is selected as the winner

• The weight is then modified to become more similar to the current gene based on the learning rate (t in the previous example)

• The winner then pulls its neighbors closer to the current gene by causing a lesser change in weight

• This process continues for all 10,000 genes

• Process is repeated until over time the learning rate is reduced to zero

2525

SOM Applied to Microarray Analysis of YeastSOM Applied to Microarray Analysis of Yeast

• Yeast Cell Cycle SOM. www.pnas.org/cgi/content/full/96/6/2907

• (a) 6 × 5 SOM. The 828 genes that passed the variation filter were grouped into 30 clusters. Each cluster is represented by the centroid (average pattern) for genes in the cluster. Expression level of each gene was normalized to have mean = 0 and SD = 1 across time points. Expression levels are shown on y-axis and time points on x-axis. Error bars indicate the SD of average expression. n indicates the number of genes within each cluster. Note that multiple clusters exhibit periodic behavior and that adjacent clusters have similar behavior. (b) Cluster 29 detail. Cluster 29 contains 76 genes exhibiting periodic behavior with peak expression in late G1. Normalized expression pattern of 30 genes nearest the centroid are shown. (c) Centroids for SOM-derived clusters 29, 14, 1, and 5, corresponding to G1, S, G2 and M phases of the cell cycle, are shown.

2626

SOM Applied to Microarray Analysis of YeastSOM Applied to Microarray Analysis of Yeast

• Reduce data set to 828 genes• Clustered data into 30 clusters using a SOFM

Each pattern is represented by its average (centroid) pattern

Clustered data has same behavior Neighbors exhibit similar behavior

2727

A SOFM Example With YeastA SOFM Example With Yeast

2828

Benefits of SOMBenefits of SOM

• SOM contains the set of features extracted from the input patterns (reduces dimensions)

• SOM yields a set of clusters

• A gene will always be most similar to a gene in its immediate neighborhood than a gene further away

2929

Problems of SOMProblems of SOM

• The algorithm is complicated and there are a lot of parameters (such as the “learning rate”) - these settings will affect the results

• The idea of a topology in high dimensional gene expression spaces is not exactly obvious– How do we know what topologies are appropriate?– In practice people often choose nearly square grids

for no particularly good reason

• As with k-means, we still have to worry about how many clusters to specify…

3030

Comparison of SOM and K-meansComparison of SOM and K-means

• K-means is a simple yet effective algorithm for clustering data

• Self-organizing maps are slightly more computationally expensive than K-means, but they solve the problem of spatial relationship

3131

Other Clustering AlgorithmsOther Clustering Algorithms

• Clustering is a very popular method of microarray analysis and also a well established statistical technique – huge amount of literature out there

• Many variations on k-means, including algorithms in which clusters can be split and merged or that allow for soft assignments (multiple clusters can contribute)

• Semi-supervised clustering methods, in which some examples are assigned by hand to clusters and then other membership information is inferred