tutorial 8 clustering 1. general methods –unsupervised clustering hierarchical clustering k-means...

Post on 22-Dec-2015

236 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Tutorial 8

Clustering

1

Clustering• General Methods

– Unsupervised Clustering• Hierarchical clustering• K-means clustering

• Expression data– GEO– UCSC– ArrayExpress

• Tools– EPCLUST– Mev

2

Microarray - Reminder

3

Expression Data Matrix

• Each column represents all the gene expression levels from a single experiment.

• Each row represents the expression of a gene across all experiments.

Exp1 Exp 2 Exp3 Exp4 Exp5 Exp6

Gene 1 -1.2 -2.1 -3 -1.5 1.8 2.9

Gene 2 2.7 0.2 -1.1 1.6 -2.2 -1.7

Gene 3 -2.5 1.5 -0.1 -1.1 -1 0.1

Gene 4 2.9 2.6 2.5 -2.3 -0.1 -2.3

Gene 5 0.1 2.6 2.2 2.7 -2.1

Gene 6 -2.9 -1.9 -2.4 -0.1 -1.9 2.9

4

Expression Data Matrix

Each element is a log ratio: log2 (T/R). T - the gene expression level in the testing sample

R - the gene expression level in the reference sample

Exp1 Exp 2 Exp3 Exp4 Exp5 Exp6

Gene 1 -1.2 -2.1 -3 -1.5 1.8 2.9

Gene 2 2.7 0.2 -1.1 1.6 -2.2 -1.7

Gene 3 -2.5 1.5 -0.1 -1.1 -1 0.1

Gene 4 2.9 2.6 2.5 -2.3 -0.1 -2.3

Gene 5 0.1 2.6 2.2 2.7 -2.1

Gene 6 -2.9 -1.9 -2.4 -0.1 -1.9 2.9

5

Microarray Data Matrix

Black indicates a log ratio of zero, i.e.

T=~R

Green indicates a negative log ratio,

i.e. T<R

Red indicates a positive log ratio, i.e. T>R

Grey indicates missing data

6

Exp

Log

ratio

Exp

Log

ratio

Microarray Data:Different representations

T<R

T>R

7

8

A real example

~500 genes3 knockdown conditions

To complicate to analyze without “help”

Microarray Data:Clusters

9

How to determine the similarity between two genes? (for clustering)

Patrik D'haeseleer, How does gene expression clustering work?, Nature Biotechnology 23, 1499 - 1501 (2005) , http://www.nature.com/nbt/journal/v23/n12/full/nbt1205-1499.html

10

Unsupervised Clustering

Hierarchical Clustering

11

genes with similar expression patterns are grouped together and are connected by a series of branches (dendrogram).

16

352 4

16

35 2 4

12

Leaves (shapes in our case) represent genes and the length of the paths between leaves represents the distances between genes.

Hierarchical Clustering

13

If we want a certain number of clusters we need to cut the tree at a level indicates that number (in this case - four).

Hierarchical clustering finds an entire hierarchy of clusters.

Hierarchical clustering result

14Five clusters

An algorithm to classify the data into K number of groups.

15

K=4

K-means Clustering

How does it work?

16

The algorithm divides iteratively the genes into K groups and calculates the center of each group. The results are the optimal groups (center distances) for K clusters.

1 2 3 4

k initial "means" (in this casek=3) are randomly selected from the data set (shown in color).

k clusters are created by associating every observation with the nearest mean

The centroid of each of the k clusters becomes the new means.

Steps 2 and 3 are repeated until convergence has been reached.

17

Different types of clustering – different results

18

How to search for expression profiles

• GEO (Gene Expression Omnibus)http://www.ncbi.nlm.nih.gov/geo/

• Human genome browserhttp://genome.ucsc.edu/

• ArrayExpresshttp://www.ebi.ac.uk/arrayexpress/

19

Datasets - suitable for analysis with GEO tools

Expression profiles by gene

Microarray experiments

Probe sets

Groups of related microarray experiments

20

Searching for expression profiles in the GEO

Download dataset

Clustering

Statistic analysis

21

Clustering analysis

22

Download dataset

Clustering

Statistic analysis

23

24

The expression distribution for different lines in the cluster

25

Searching for expression profiles in the Human Genome browser.

26

Keratine 10 is highly expressed

in skin

27

28

http://www.ebi.ac.uk/arrayexpress/

ArrayExpress

29

30

What can we do with all the expression profiles?

Clusters!

How?

EPCLUST

http://www.bioinf.ebc.ee/EP/EP/EPCLUST/

31

32

33

34

35

36

Edit the input matrix: Transpose,Normalize,Randomize 37

Hierarchical clustering

K-means clustering

In the input matrix each column should represents a gene and each row should represent an experiment (or individual).

38

Clusters

Data

Edit the input matrix: Transpose,Normalize,Randomize 39

Hierarchical clustering

K-means clustering

In the input matrix each column should represents a gene and each row should represent an experiment (or individual).

Graphical representation of the

cluster

Graphical representation of the

cluster

Samples found in cluster

40

10 clusters, as requested

41

42

http://www.tm4.org/mev/

Multi experiment viewer

top related