tutorial 8 clustering 1. general methods –unsupervised clustering hierarchical clustering k-means...

Tutorial 8

Clustering

Clustering• General Methods

– Unsupervised Clustering• Hierarchical clustering• K-means clustering

• Expression data– GEO– UCSC– ArrayExpress

• Tools– EPCLUST– Mev

Microarray - Reminder

Expression Data Matrix

• Each column represents all the gene expression levels from a single experiment.

• Each row represents the expression of a gene across all experiments.

Exp1 Exp 2 Exp3 Exp4 Exp5 Exp6

Gene 1 -1.2 -2.1 -3 -1.5 1.8 2.9

Gene 2 2.7 0.2 -1.1 1.6 -2.2 -1.7

Gene 3 -2.5 1.5 -0.1 -1.1 -1 0.1

Gene 4 2.9 2.6 2.5 -2.3 -0.1 -2.3

Gene 5 0.1 2.6 2.2 2.7 -2.1

Gene 6 -2.9 -1.9 -2.4 -0.1 -1.9 2.9

Expression Data Matrix

Each element is a log ratio: log2 (T/R). T - the gene expression level in the testing sample

R - the gene expression level in the reference sample

Exp1 Exp 2 Exp3 Exp4 Exp5 Exp6

Gene 1 -1.2 -2.1 -3 -1.5 1.8 2.9

Gene 2 2.7 0.2 -1.1 1.6 -2.2 -1.7

Gene 3 -2.5 1.5 -0.1 -1.1 -1 0.1

Gene 4 2.9 2.6 2.5 -2.3 -0.1 -2.3

Gene 5 0.1 2.6 2.2 2.7 -2.1

Gene 6 -2.9 -1.9 -2.4 -0.1 -1.9 2.9

Microarray Data Matrix

Black indicates a log ratio of zero, i.e.

Green indicates a negative log ratio,

i.e. T<R

Red indicates a positive log ratio, i.e. T>R

Grey indicates missing data

Microarray Data:Different representations

A real example

~500 genes3 knockdown conditions

To complicate to analyze without “help”

Microarray Data:Clusters

How to determine the similarity between two genes? (for clustering)

Patrik D'haeseleer, How does gene expression clustering work?, Nature Biotechnology 23, 1499 - 1501 (2005) , http://www.nature.com/nbt/journal/v23/n12/full/nbt1205-1499.html

Unsupervised Clustering

Hierarchical Clustering

genes with similar expression patterns are grouped together and are connected by a series of branches (dendrogram).

35 2 4

Leaves (shapes in our case) represent genes and the length of the paths between leaves represents the distances between genes.

Hierarchical Clustering

If we want a certain number of clusters we need to cut the tree at a level indicates that number (in this case - four).

Hierarchical clustering finds an entire hierarchy of clusters.

Hierarchical clustering result

14Five clusters

An algorithm to classify the data into K number of groups.

K-means Clustering

How does it work?

The algorithm divides iteratively the genes into K groups and calculates the center of each group. The results are the optimal groups (center distances) for K clusters.

1 2 3 4

k initial "means" (in this casek=3) are randomly selected from the data set (shown in color).

k clusters are created by associating every observation with the nearest mean

The centroid of each of the k clusters becomes the new means.

Steps 2 and 3 are repeated until convergence has been reached.

Different types of clustering – different results

How to search for expression profiles

• GEO (Gene Expression Omnibus)http://www.ncbi.nlm.nih.gov/geo/

• Human genome browserhttp://genome.ucsc.edu/

• ArrayExpresshttp://www.ebi.ac.uk/arrayexpress/

Datasets - suitable for analysis with GEO tools

Expression profiles by gene

Microarray experiments

Probe sets

Groups of related microarray experiments

Searching for expression profiles in the GEO

Download dataset

Clustering

Statistic analysis

Clustering analysis

Download dataset

Clustering

Statistic analysis

The expression distribution for different lines in the cluster

Searching for expression profiles in the Human Genome browser.

Keratine 10 is highly expressed

in skin

http://www.ebi.ac.uk/arrayexpress/

ArrayExpress

What can we do with all the expression profiles?

Clusters!

EPCLUST

http://www.bioinf.ebc.ee/EP/EP/EPCLUST/

Edit the input matrix: Transpose,Normalize,Randomize 37

Hierarchical clustering

K-means clustering

In the input matrix each column should represents a gene and each row should represent an experiment (or individual).

Clusters

Edit the input matrix: Transpose,Normalize,Randomize 39

Hierarchical clustering

K-means clustering

In the input matrix each column should represents a gene and each row should represent an experiment (or individual).

Graphical representation of the

cluster

Graphical representation of the

cluster

Samples found in cluster

10 clusters, as requested

http://www.tm4.org/mev/

Multi experiment viewer

tutorial 8 clustering 1. general methods –unsupervised clustering hierarchical clustering k-means...

hierarchical clustering

gene expression levels

gene microarray experiments

expression data matrix

hierarchical clustering

missing data

data set

clustering patrik dhaeseleer

Documents

miame, arrayexpress and the data submission tool miamexpress

anatomy ontology evaluation @ arrayexpress

susy studies at ucsc

1 mage-om and arrayexpress database model ugis sarkans, ebi

ucsc climate action plan

1 introd higiene ucsc

welcome to ucsc silicon valley - study-lamn.by...dean of...

banco central ucsc

ucsc genome browser

presentación plaza social ucsc

ucsc cancer browser workshop

enap - ucsc

ucsc prospectus

the ucsc genome browser

curriculum teatral ucsc

arrayexpress and expression atlas: mining functional...

postgraduate ucsc 14 15

arrayexpress – a public database for microarray gene...

Índice - ucsc

ucsc convegno_milano_oct_2013.pdf