bioinformatique : transcriptomique iilecompte/cours/courstranscrii_sept2010.pdf · hierarchical...

102
BioInformatique : Transcriptomique II ULP 2010 Wolfgang Raffelsberger Olivier Poch Laboratoire de Bio-Informatique et Génomique Intégrative IGBMC, Strasbourg [email protected] [email protected]

Upload: others

Post on 11-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

BioInformatique : Transcriptomique IIULP 2010

Wolfgang Raffelsberger Olivier Poch

Laboratoire de Bio-Informatique et Génomique IntégrativeIGBMC, Strasbourg

[email protected]

[email protected]

Page 2: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Summary : Principal Steps in Transcriptomics(DNA microarrays)

© Wolfgang Raffelsberger, IGBMC, 2008

mRNA,QC

hybridizelabelled cDNAon microarray

AAA

fluorescenceimaging

(Summarize,)Normalize

Filterinformative /

non-informative genes

Clustering

Statistical inference testing

Quality Control

BiologicalQuestion

Page 3: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Steps and Options in Data Analysis

Generation of Gene Raw Expression Data

Data Preprocessing : Eliminate BiasNormalization (Scaling) : linear , non-linear (LOWESS)

Inferential Statisticsassign confidence to discovery of regulated genes

Descriptive StatisticsExploratory techniques

Define Similarity MeasurePearson coefficient of correlationEuclidean distance

Unsupervised (Discovery)Clustering : hierarchical,

k-means, SOMPCA, Mult.Dim.Sca.(MDS)

Supervised : Classificationsupport vector machines (SVM)linear discriminant analysis (LDA)decision trees

Parametric Testst-TestANOVA

Non-Parametric TestsWilcoxonMann-Whitney

A B50 10070 40... …

Biological Question

Nicolas Wicker

Page 4: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Statistical Tests for Differential Gene Expression: Nicolas Wicker

• Gaussian distribution ? → Kolmogorov test

PFER Per-Family Error RatePCER Per-Comparison Error RateFWER Family wise Error RateFDR False Discovery Rate

▫ t -Test with correction - Bonferroni (for n times H0 Hypothesis) - extremely stringent !- FDR : step-up procedure (Benjamini & Hochberg 1995) - less stringent

▫ Multiple Groups : ANOVA , F-statistics

• Adjusted p-Values (Westfall & Young 1993)Test level (size) not to be determined in advanceStep-down procedure, less conservative than Bonferroni, Holm or Hochberg

• Neighborhood analysis (Golub et al) : weak control of FWER

• SVM, Bayesian approaches

▫ Wilcoxon (-Mann-Whitney)

▫ Permutation Tests (Significance Analysis of Microarrays) : 2 versions - Efron (2000) weak control of PFER- Tusher (2001) strong control of PFER

non-Gaussian

Page 5: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Myth“Microarray investigations are unstructured

data-mining adventures without clear objectives”

• Good microarray studies have clear objectives,avoiding gene specific mechanistic hypotheses

• Design and Analysis methods tailored to study objectives

Solution

Page 6: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Data Filtering

• Numerous algorithms require data reductionCPU & RAM used increases ~exponential with number of genes processed

• However : The mass of constant genes contains information about noisiness of system, background ...

• Filtering to remove poor and irrelevant data

Filters : low- intensity cutoff, genes with low inter-sample variability, replicate var. trimming, intensity-dependent z-score cutoffs, Flags or Calls, dedicated tools & packages

© Wolfgang Raffelsberger, IGBMC, 2008

• non-DE genes decreases statistical sensitivity of detection

Page 7: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Steps and Options in Data Analysis

Generation of Gene Raw Expression Data

Data Preprocessing : Eliminate Bias, FilteringNormalization (Scaling) : linear , non-linear (LOWESS)

Inferential Statistics Descriptive Statistics

Define Similarity Measure

Unsupervised Clustering Supervised : Classification

Parametric Tests Non-Parametric Tests

A B50 10070 40... …

Biological Question

Page 8: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Statistical Tests and Clustering Pipeline

platforms & tools :

Clusteringdifferential expression

Functional Enrichmentof Groups /Clusters

Statistical Testsdifferential expression,

Outlyer values

filtering

© Wolfgang Raffelsberger, IGBMC, 2008

raw data processing

Page 9: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Basic Definition of “Cluster”“ 1) close group of similar things growing together2) close group of people, animals, stars, ... etc. ”

Oxford Encyclopedic Dictionary

• Arrange a population of elements based on their properties

Bioinformatics Context

What is Clustering ?

• Arrange genes in a way where similarly behaving genes get placed close/next to each other

- gene expression patterns

- gene location on a genome

- based on functional domains or protein-interactions© Wolfgang Raffelsberger, IGBMC, 2008

Page 10: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Hierarchical Cluster Analysis

Data : Measure all distances between individual points

DescendingInitially : 1 class

Create new classfor most distant (i+1)

AscendingInitially : n classes

Combine the closestto one class (i -1)

Build Dendrogram by branching off classes

© Wolfgang Raffelsberger, IGBMC, 2008

51234

2153

4

equivalent to

Page 11: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Hierarchical Cluster Analysis

Metrics for measuring distances :

Euclidean distance Correlation (Pearson) Spearman rank, …

• Single Linkageminimum cluster-to-cluster distancebetween members of one cluster to members of other cluster

• Complete Linkagemaximum cluster-to-cluster distance

• Average Linkageaverage cluster-to-cluster distance

• Centroid Linkageaverage of all individual points distances

Page 12: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Hierarchical Clustering : Average Linkage

Different metrics available for measuring distance :Euclidean distance, Correlation (Pearson), Spearman rank, …

© Wolfgang Raffelsberger, IGBMC, 2008

Page 13: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

72 h

6 h

1 h

Page 14: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Self - Organizing Maps (SOMs)or Topological Maps (Kohonen)

wr, modified from Tamayo et al, 1999

Example for a 3 x 2 geometry :

data points

initial grid (rectangular)

node migration (hypothetical)

node (final)

1

3

2

4

5 6

Partitioning Method :

devide in k classes and distribute (optimized)

following a predefined criteria (grid)

n observations, K classes : partitionsKN

K !

Page 15: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Graphical Cluster Profile Representation

• Cluster Profile(option Box-Plot)

• all individual genes

Cho, 1998Yeast cell cycle

n = 3000k Means

Page 16: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

More Options in Clustering

• Determination of number of clusters- not all data must be forced in clusters !- apply stop-rule during clustering - Principal Component Analysis to determine number of clusters

• Clustering methods capable of analyzing :- non Gaussian distributions- clustering qualitative ‘data’ (e.g. domains, literature-citations … )

Page 17: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Results of PCA : small set of the orthogonal (independent)Variables containing most of the variance

y1

y2 corr (x1, x2)= 0.90

- Search axis with highest correlation

x2

x1

- Define this as (First) Principal Component (y1 captures the major fraction (~90%) of the dispersion)

- Take 2nd axis orthogonal ( i.e. 0 correlation)

Principal Component Analysis : Example

Page 18: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Comparison : Array-data and real-time PCR

Myc Bfl1 Lsp1 TGM2 Elan G1P3

Array real-time PCRx-fold gene induction

12 additional genes showed no changes by both methods

6

4

10

2

8

no change-2

75

100

50

NB4, untreated, A

NB4, untreated, BNB4, 18h atRA, C

x-fold gene induction

25

125

>200

Myc Bfl1 Lsp1 TGM2 Elan G1P3

no change

Raffelsberger et al, Genomics 2002Raffelsberger et al, Genomics 2002

Page 19: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Accuracy and Precision

good precision low precisionlow accuracy good accuracy

© Wolfgang Raffelsberger, IGBMC, 2008

reproducibility

check by replicates

poor precision results :poor technique

correctness

check by different method

poor accuracy results : procedural or equipment flaws

Page 20: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Web Resources : General InformationBrown Lab Guide

http://cmgm.stanford.edu/pbrown/Microarrays protocols and arrayer construction.

DeRisi Labhttp://www.microarrays.org/

Software : ArrayMaker, ScanAlyze,Cluster, Treeview

SMD guidehttp://genome-www5.stanford.edu/

Stanford Microarray Database, links

NIH microarray projecthttp://research.nhgri.nih.gov/microarray/main.html

Protocols, software (ArraySuite)

EBIhttp://www.ebi.ac.uk/microarray/

Protocols, software

TIGRhttp://www.tigr.org/tdb/microarray/

Protocols, software

IGBMChttp://www-microarrays.u-strasbg.fr/

General information, IGBMC related details

gene-chipshttp://www.gene-chips.com/

Software ArrayTack (integrated software system for managing, mining, and visualizing microarray gene expression data)

Davidson College Genomicshttp://www.bio.davidson.edu/

General information, flash animation

Page 21: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Web Resources : Data Analysis Tools

Eisen Lab Michael Eisen's suite for image quantitation and data analysis(Scanalyze, Cluster, TreeView).

Expression Profiler Online clustering and analysis tools (EBI)

GenEx Database, repository and analysis tools (NCGR)

MAExplorer MicroArray Explorer for data mining Gene Expression

TM4http://www.tm4.org/

TIGR Microarray Data Analysis System (MIDAS) : data quality filtering and normalization tool for data processing (various normalizations, filters, and transformations) analysis pipeline

MAXD Downloadable data warehouse and visualisation for expression data

BioInformatics TU Grazhttp://genome.tugraz.at/

Analysis tools Genseis

Jexpress Java tools for gene expression data analysis

Page 22: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Steps and Options in Data Analysis

Generation of Gene Raw Expression Data

Data Preprocessing : Eliminate BiasNormalization (Scaling) : linear , non-linear (LOWESS)

Inferential Statisticsassign confidence to discovery of regulated genes

Descriptive StatisticsExploratory techniques

Define Similarity MeasurePearson coefficient of correlationEuclidean distance

Unsupervised (Discovery)Clustering : hierarchical,

k-means, SOMPCA, Mult.Dim.Sca.(MDS)

Supervised : Classificationsupport vector machines (SVM)linear discriminant analysis (LDA)decision trees

Parametric Testst-TestANOVA

Non-Parametric TestsWilcoxonMann-Whitney

A B50 10070 40... …

Biological Question

Biological Question, Analysis, Interpretation, Verification …

Page 23: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Transcriptomics in Biology: Jan-Oct 2009

Page 24: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Transcriptomics in Biology : 01 Sept - 06 Oct. 2008!!!Molecular pathways : induced/non induced by mutation, drug, disease, infection…Cell cycle, apoptosis, antigen presentation, immune response tolerance to hypoxia, TF pathways (NR, Homeobox, Map kinase pathways…), bacterial growth, Immunology & malaria, hypoxia & cancer, prefrontal & alcoholism, virus & macrophage survival, biological clocks & drugs, superoxide dismutase & brain, neuroprotection, neo-vascularisation, plant resistance to virus or fungi,

Molecular Diagnosis : disease/control cells, populationCancer : prostate, gastric, lung, breast, colorectal, lymphome, ovarian, brain Pathways : Map kinase, cell channel)Susceptibility & Risk evaluation : obesity, cancer, autism, psoriasis

Progression/Prognostic : affected /control, time series, populationParticular genotype implication in progression/survival : Cancer (metastase), disease (autism, schizophrenia , alzheimer, hodgkinSusceptibility & Risk evaluation : cancer, alcoholism, graft rejection, allograft rejection, blood cell transplantation, retroviral integration, neuroplasticity (cocaine)

Development : time seriesOrgans : brain, liver, hepatocyte, spermatogenese, kidneyBiological clocks, molecular repertoire of cell types

Evolution : various speciesHuman/Chimp, human/mouse, response to exogenous DNA!, microarray & species origin!

Behaviour, Ecology, Technology : population, cellsAggressiveness, mating, senescence, ecosystem (bacteria + archaea + plants), tissue disruption & RNA level, benchmark comparison (yeast, tissue, single cell)

Page 25: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Transcriptomics in Biology

Microarrays coupled to :

° Biochemistry : enzymatic activities° Proteomics ° Interactomics° FISH° Immunohistochemistry : prostate° Genotyping : autism° Transcription Cell Arrest° Comparative Genomic Hybridisation (CGH)

Page 26: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Ioannidis et al: Repeatability of published microarray gene expression analyses

Page 27: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Ioannidis et al: Repeatability of published microarray gene expression analyses

18 articles (using microarrays, Nature Genetics 2005-2006)

Evaluate by 2 independent teams

Number Outcome Comments

2 repeatable

6 partially repeatable Data Annotation

10 not repeatable Data NOT Available

16 declared that data are publicly available 13 had GEO number

Page 28: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Ioannidis et al: Repeatability of published microarray gene expression analyses

18 articles (Nature Genetics, 2005-2006)

16 declared that data are publicly available

13 had GEO number

8 Unprocessed raw data available

4 Sufficient detail on data processing documented

(others : limited or ambigous)

2 Not possible to know array platform !

2 Not possible to associate which array to which sample

2 Minor differences in results from independent analysts

5 Discrepancies in results from independent analysts

1 Major discrepancies in results from independent analysts

Page 29: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Reproducibility : within laboratory good,

between platforms and across laboratories: generally poor.

Major lab-effect (in particular with two-color platforms)Overall agreement : apr. 30-40 % [33% - 81%]

Standardisation (MIAME, Toxicogenomics Research Consortium)RNA expression data comparison : 7 laboratories, 2 standard RNA samples using12 microarray platforms. 2 standard microarray types(one spotted, one commercial) were used by all labs.

Reproducibility increased markedly : Standardized protocols for all steps (RNA labeling, hybridization, microarray, processing, data acquisition and data normalization).

Reproducibility highest : When analysis based on biological themes defined by enriched Gene Ontology (GO) categories.

Need : guidelines for publishing microarray data through the Minimal Information About Microarray Experiments (MIAME) standards

Page 30: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Minimum Information About Microarray Experiments -MIAME

• Specify minimum information that must be reported with gene expression monitoring experiments

developed by the MGED working group

• Ensure Interpretability of results

• Facilitate potential verification

• Facilitate establishing repositories

• Enable unique data exchange format

• Many scientific journal require data submission following MIAME standards

Page 31: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

GEOhttp://www.ncbi.nlm.nih.gov/geo/

(NCBI) Gene expression data repository & online resources

ArrayExpresshttp://www.ebi.ac.uk/arrayexpress

EBI database & analysis tools

Chip DBhttp://young39.wi.mit.edu/chipdbpublic Whitehead Institute array database

ExpressDB Harvard, specialized on yeast (and E. coli) data

RAD RNA Abundance Database

Stanford Microarray Databasehttp://dnachip.org Stanford Database & online tools

MAdbhttp://www.cbil.upenn.edu/rad2/servlet National Cancer Institute

GXDhttp://informatics/jax.org/ Jackson Laboratory (specialized on mouse)

Web Resources : Public Databases

Page 32: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

http://www.ncbi.nlm.nih.gov/projects/geo/

Page 33: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation
Page 34: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

GEO http://www.ncbi.nlm.nih.gov/geo

Query / Data Sets : 1574 → GDS1574 recordEndotoxin effect on leukocytes: time course (HG-U133A) [Homo sapiens]

Try Clustering (Hierarchical or k-Means)

Statistical testing : Compare 0h to 2h (or 4h)

Can you find inflammation related genes ?Stat1-6, MAP kinases, Interleukins, NFκB, ILR1&2, TLR1-5, …

Page 35: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation
Page 36: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation
Page 37: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation
Page 38: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation
Page 39: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

http://www.ebi.ac.uk/arrayexpress

Page 40: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

ArrayExpress Conceptual Overview

Referencee.g. publication, www

Sample-Ontologye.g. organism, taxonomy

Nucl & Prot DBSwissProt, gb, …

Hybridizationprotocol, …

Expression Valuepos., flouresc. intens.

Experimentname, type, …

Arraytype, elements, …

Sampleorigin, prep, …

Page 41: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Stanford Microarray Database

Page 42: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Analyze printed arrays on Stanford Microarray Database

Open web-browserand go to :

http://genome-www5.stanford.edu/

Page 43: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation
Page 44: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Stanford data base 1

Page 45: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Stanford data base 1

Page 46: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Stanford data base 1

Page 47: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Analyze printed arrays on Stanford Microarray Database

Try to test some parameters that may be useful for filtering out non-informative information(signals to close too background, …)

Are ratios dependent on signal intensity ?(this would indicate possible need for additional lowess normalization…)

Show effect of normalization on histogram of data

Page 48: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Stanford data base 1

Try to plot all data effect of Filter 1 effect of Filter 2 combine Filter 1&2

Page 49: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Stanford data base 1

no filteringn=37013

Filtering cor + ch1&2 S/N

n=6886

Filtering cor>0.6n=10537Filtering ch1 S/Nn=13868

Page 50: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Stanford data base 1

Page 51: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Stanford data base 1

use as default …

Page 52: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Stanford data base 1

this may take a few minutes … next : Proceed to Clustering

Page 53: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Stanford data base 1

this may take a few minutes … next : Proceed to Gene Filtering… filters to 412 seq_IDs and 20 arrays

Page 54: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Stanford data base 1

you may try as default …

Page 55: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation
Page 56: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

72 h

6 h

1 h

Page 57: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

72 h

6 h

1 h

Page 58: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

TP stanford DB

Page 59: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

SMD Sub-Cluster List1SMD_ID UniGene GeneID Genbank H3009A06 Mm.29778 Armet BG0635801600000K21 Mm.29778 Armet AV033958 H3130B10 Mm.227583 Atp2a2 BG0740442010000C19 Mm.238973 Atp5b AV061729 H3056G04 Mm.260325 Bst2 BG067655UI-M-AO0-acg-a-04-0-UI Mm.260325 Bst2 AI841611 H3098E07 Mm.277665 Calb1 BG071405H3097C12 Mm.257765 Clic4 BG071310UI-M-AO0-acf-h-12-0-UI Mm.298875 Clta AI838788 1500040H21 Mm.41614 Cmtm5 AV033409 1810022N06 Mm.292567 Creld2 AV052673 H3023F06 Mm.296789 Entpd7 BG064768 H3060C06 Mm.88793 Fga BG067969H3154E07 Mm.3982 Gas6 BG076011L0079H05 Mm.371563 H3f3b AW551507 H3013D08 Mm.371563 H3f3b BG0639232610017D13 Mm.371563 H3f3b AV112938 1500015A02 Mm.371563 H3f3b AV030583 C0124D04 Mm.371563 H3f3b AW539780 UI-M-AM0-adv-c-03-0-UI Mm.371563 H3f3b AI841334IMAGE:699319 Mm.1351 Hoxc4 AA245472 H3032A08 Mm.330160 Hspa5 BG065517H3015H11 Mm.259672 Ibrdc3 BG0641402310031I05 Mm.34871 Id2 AV087554 H3118H06 Mm.34871 Id2 BG073126H3098G05 Mm.34871 Id2 BG0714213200002K22 Mm.110 Id3 AV171079 2310022H23 Mm.24769 Ifi47 AV086656 H3129G05 Mm.253335 Ifrg15 BG074005H3038A06 Mm.253335 Ifrg15 BG066021H3135C12 Mm.211654 Ifrg15 BG074451H3025A01 Mm.253335 Ifrg15 BG064887H3157D02 Mm.33902 Igtp BG0762451110065O20 Mm.33902 Igtp AV016524 H3050H06 Mm.105218 Irf1 BG067127IMAGE:555853 Mm.285 Kdr AI3250281810059C02 Mm.3152 Lgals3bp AV059520 H3108B10 Mm.294753 Litaf BG0722271700010C22 Mm.244890 Lrrc6 AV039925 H3027D05 Mm.788 Ly6e BG065103H3143C12 Mm.1597 Mov10 BG075103J0806F10 Mm.270331 Msi2 AU040439 1810074C23 Mm.192991 Mt1 AV061461 H3020C02 Mm.192991 Mt1 BG0644802900001J17 Mm.192991 Mt1 AV149953 H3013D11 Mm.147226 Mt2 BG0639252310006N23 Mm.147226 Mt2 AV084363 H3107C06 Mm.213003 Myd88 BG072146H3008E07 Mm.250418 Ogfr BG0635421020010I05 Mm.49074 Parp9 AV006271 J1021F01 Mm.17932 Pnp AU042511 H3116A06 Mm.27769 Pnrc1 BG072875H3114E02 Mm.18347 Psmd7 BG072746H3059F12 Mm.15793 Psme2 BG067921IMAGE:466539 Mm.262 Rhoc AI324259

Igtp : interferon gamma induced GTPase

Armet : arginine-rich, mutated in early stage tumors

H3f3b : H3 histone, family 3B (H3.3B)

Hoxc4 : homeo box C4

ID2 : inhibitor of DNA binding 2, dominant negative helix-loop

Ifrg15 : interferon alpha responsive gene

Irf1 : interferon regulatory factor 1

Mt1 : metallothionein 1

Mt2 : metallothionein 2

Myd88 : myeloid differentiation primary response gene 88

Pnp : purine-nucleoside phosphorylase

Bst2 : bone marrow stromal cell antigen 2

Creld2 : cysteine-rich with EGF-like domains 2

Gas6 : growth arrest specific 6

Psme2 : proteasome (prosome, macropain) 28 subunit, beta

Page 60: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Data - Mining ?

Page 61: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Finding Gene Interactions

• “dimensionality reduction” : find groups of related genes

Coexpression clustering

Mining Promoter elements

Gene Ontology or other knowledge-bases

Connections to public databases (Genbank, Omim, PDB, patents,...)

Metabolic pathways

Interactome networks

Literature & co-citations

Relations to linkage data

Page 62: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Aim :Establishment of a standardized vocabulary with hierarchical organization to describe the role of genes and proteins.

Advantages :Standardization :

→ clear and distinctive definition of functions→ comparative genomics at functional level

The brief and unique descriptions, exploitable manually or by programs

3 GO-Ontologies :• Molecular Function : transcription factor, DNA polymerase, ...

• Biological Processes : mitosis, replication, ...

• Cellular Component : localization, complexes, ...

http://www.geneontology.org/Gene Ontology (GO)

Page 63: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

http://david.abcc.ncifcrf.gov/David : Clustering of Enriched Gene Ontology (GO) terms

Page 64: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Gene NetworksBased on KEGG

http://www.ariadnegenomics.com/products

PathwayStudio

http://www.genome.jp/kegg/

Page 65: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

developed at EMBL, SIB and UniZH

Snel B, et al (2000) NAR : STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene.v Mering C, et al (2007) NAR : STRING 7- recent developments in the integration and prediction of protein interactions.

Gene Networks : STRING http://string.embl.de/

Page 66: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Gene NetworksBased on Ingenuity Pathway Maps (IPA)manually mined & curated database of interactionshttp://www.ingenuity.com

Thiersch et al, 2008

Page 67: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

More Pathway Programs

Grid (www.dbgrid.org/ )

Mint (mint.bio.uniroma2.it/mint )

Cytoscape (http://www.cytoscape.org)

HiMap (http://www.himap.org/)

GenMAPP (www.genmapp.org)

Bind (www.bind.ca)

BioCarta (www.biocarta.com)

BPS: Biochemical Pathway Simulator(http://www.brc.dcs.gla.ac.uk/projects/bps/links.html)

BioPax (http://cbio.mskcc.org/tools/all.html)

Page 68: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

BIOINFORMATIC ANALYSIS OF GENE EXPRESSION IN THE DEGENERATING RETINA OF THE rd1 MOUSE, A MODEL OF

RETINITIS PIGMENTOSA

Thierry Léveillard, José SahelLaboratoire de Physiopathologie Cellulaire et Moléculaire de la Rétine,

Inserm U592, Université Pierre et Marie Curie

Frédéric Chalmel, Olivier PochDépartement de Biologie et Génomique Structurales

Groupe de Bioinformatique et Génomique Intégratives

Christian Lavedan, George LambrouNovartis Institutes for Biomedical Research

Page 69: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

La rétine

LumièreLumière

Sou

rce

http

://w

ebvi

sion

.med

.uta

h.ed

u/

Photorécepteurs

Bipolaires

Ganglionnaires

Épithélium pigmentaire

Membranes internes

2 types de photorécepteurs :

Bâtonnets (H:95 %) vision périphérique et nocturne

Cônes ( H:5 %) vision de précision, colorée et diurne

BâtonnetsCônes

Page 70: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Cone densities in human retina

Page 71: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Normal human retina Retinitis pigmentosa

Eye fundus examination

Page 72: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Prevalence of retinal degenerations

• Retinitis Pigmentosa (RP):Monogenic40,000 patients in France

• Aged-related Macular Degeneration (AMD) :Multifactorial and polygenic2 millions patients in FranceIncidence : 10% at 65, 25% at 75, 60% at 90

Page 73: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

La retinitis pigmentosaMaladie monogénique (158 loci connus); 30 000 à 40 000 patients en France

Rétine humaine normale atteintePremier symptômes :

Adaptation à l’obscuritéChamp visuel périphérique

Perte des bâtonnets

15 à 30 ans

Cécité complète

Perte secondaire des cônes

Cécité

Page 74: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

The photransduction cascade

Page 75: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

The rd1 mouse model (and human recessive RP)

GCα

γ

βm

GMP

cGMPLight Dark

Apoptosis (Chang et al, Neuron, 1993; Portera-Caillaux et al, PNAS, 1993)

?

Page 76: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

La souris rd1 : Un bon modèle pour l’étude de la retinitis pigmentosa humaine

1 mutation sur la βPDE Augmentation [cGMP]

Perte brutale des bâtonnets de 10 à 35 Jour Post-Natal (JPN)

Perte progressive des cônes (3 mois)

Cônes

0123456

3 155 7 11 191

Nombre de PR dans une section de rétine

Semaines Post-Natal

Bâtonnets

JPN8 JPN9 JPN15 JPN35JPN5

Sou

ris s

auva

geS

ouris

rd1

Souris rd1 : très bon modèle qui mime la maladie humaine

βPDE

Page 77: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Cones need rods to survive

Rod therapies

Cone therapies

Page 78: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Élaboration de stratégies thérapeutiques des dégénérescences rétiniennes

Un modèle la souris rd1

Perte des bâtonnets

Perte des cônes :

Mutation

Analyses différentielles :• Transcriptome• Protéome

Criblages fonctionnels :• Clonage des facteurs de viabilité des cônes• Identification des signaux de régénérescences • Identification de ligand d’un récepteur nucléaire (PNR)

Modèles transgéniques :• La souris• La drosophile

Identification des mécanismes moléculaires de

dégénérescences et desmécanismes endogènes de

neuro-protection

Solutions

thérapeutiques

CECITE

Page 79: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Method to study rd1 : Affymetrix DNA chips

cRNAoligos

PMMM

probe pair

probe set

MisMatch

Perfect Match TGAACTAGCTCGTACCGCTACGGAA

TGAACTAGCTCGXACCGCTACGGAA

Interrogation of 30,000 of mRNA

transcripts

Each probe cell contains millions of copies of a specific

oligonucleotide probe (25 mer)

Each transcript is represented by multiple

(20) probe pairs

}

Page 80: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Mouse Affymetrix DNA chips

wt strain

rd1 strain

1d 4d 5d 8d 9d 10d 11d 12d 14d 15d 35d

Rod loss

Birth

Expression level Rhodopsin

36700 clones with 2 x 11 affymetrix probe sets of expression

Post-natal week

Rods

Cones

0

1

2

3

4

5

6

3 155 7 11 191

Cell number

Page 81: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Pre-analysis of the data

36700 probe sets

Clones with maximal

expression level < 100

Clones with

standard deviation < 1023504 probe sets

Goals :

• Group genes with close expression profiles

• Select clusters with a significantly difference between both strains

• Analyse of each clone individually and characterized their respective cluster

Page 82: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Quality Index (W. Raffelsberger RetNet RTN)

• Information about reliability of measurement- Homogeneity within probe-set- Replicate measurementsto be considered in combination with degree of gene regulation

• Challenge : Data on logarithmic distribution containing negative values

• CV (s/μ) good for Signal Intensities >1 distortion/bias at signal intensities -1 < x < +1

• New method based on tanh- No Bias for values <1- Same procedure used for probe-sets homogeneity andreplicates

• Each measure gets corresponding Quality Index

Page 83: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Replicate Quality Indicator (rqi)

• Comparison of CV (s/μ) and tanh-based method relative to signal intensity - No Bias for values <1- Same procedure used for probe-sets homogeneity and replicates

rqi

Signal Intensity

probe-set Quality Indicator

probe-set SE

Page 84: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Avoiding Artifacts in Differential Regulation

• Low level gene regulation : in most cases artifacts

• High level gene regulation : true regulation even if semiquantitativebiological conclusion remains same

• Integrated Model

Result : low stringency at high gene regulationmoving gradually to

high stringency at low regulation

Page 85: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

35d wt

1d wt

1d rd1

15d wt

10d wt

12d rd1

Clustering, DPC program Density of Point Clustering

22 conditions 22 dimensions

Clustering step

Cluster 1

Cluster 2

Cluster 3

23504 Probe Sets

Normalised level expression

wt strain rd1 strain

1 351 35

7 clusters grouping genes with close expression profiles

Cluster 0 : 2483 PS

Cluster 1 : 5535 PS

Cluster 2 : 1768 PS

Cluster 3 : 3904 PS

Cluster 4 : 2323 PS

Cluster 5 : 2459 PS

Cluster 6 : 5032 PS

Genes

Page 86: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

wt rd1 wt rd1

Clusters selection with differential profile between wt/rd1 expression (Hotelling t2 test)

Cluster 0

2483 PS

D = 143

Cluster 1

5535 PS

D = 21

Cluster 2

1768 PS

D = 240

Cluster 3

3904 PS

D = 20

Cluster 4

2323 PS

D = 432

Cluster 5

2459 PS

D = 467

Cluster 6

5032 PS

D = 21

4 selected clusters

9033 Probe sets

23504 PS

Page 87: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

-2,0

-1,0

0,0

1,0

2,0

1 4 5 8 9 101112 1415 35

Groupe 0: 2483 clones

1 4 5 8 9 1011 12 14 15 35-2,0

-1,0

0,0

1,0

2,0Groupe 2: 1768 clones

-2,0-1,00,01,02,0

1 4 5 8 9 101112 1415 35

Groupe 4: 2323 clones3,0

-2,0

-1,0

0,0

1,0

2,0

1 4 5 8 9 1011 12 1415 35

Groupe 5: 2459 clones

wt micerd1 mice

Most differencially expressed probe set selection

JPN35 334 clones3wt

1≥

rd

JPN35 494 clones31

wt≥

rd

JPN9 251 clones31

wt≥

rd

JPN8 245 clones3wt

1≥

rd

1324

4 groupes et 1324 target probe sets

Page 88: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Mouse Affymetrix data

Clustering step

Clusters selection Hotelling t2 test

1324 Probe sets

Clusters analysis by statistical validation z score

Overview of analysis protocol

Gene Ontology (GO)

Proteins domains

Tissue type & Dev. stage

Human localisation(s)

Page 89: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Large amount of “linked” biological data(genome, 3D complex, organelle, I.P., differential display,

“omics” data…)

Common, universal analyses and cross-validation:

* databases searches* complete alignment & differential conservation

* data over-representation (z-score)……

Specialized analyses, cross-validation, outputs according to the biological link:

* co-expression (data or database)* synteny, comparative genomics (genome)* spatial confinement (IP, interactomics…)

* target characterisation (structural project…)…

Functional Genomics : Gscope platform

Integration, validation and analysis of high throughput information

Page 90: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

DNA and/or Proteome

Pipe-Align

Database Searches• BlastP on Swissprot, TrEmbl, PDB• BlastX for missing ORFs• TBlastN on complete genomes

DNA processing• ORF location• External programs (Glimmer, tRNAscan, CodonW...)

Initial GScope database creation

Common steps

Clustering schemes

General Information ValuesORF (Overlap, length,…) GC contentCodon UsageShineDalgarno presenceStart codon (M, V or L)

Homologue Counts- overall- structures- paralogues- in complete genome

Gscope: genomic platform

Sequence validation :Homolog detection agreementValidated start codon

Phylogenetic relationships :Gene cluster maintenanceGene lossesDistance tree analysis

Structural information :Domain organisationProduction in E.coli, YeastHydrophobicity indexHydrophobic helices…

Target identification:X-HDA analysisValidated GO (z score)Integrated synteny

Alvinella annotation :cDNAs & target characterision

Data cross-correlationPredictions

Specialised steps

Page 91: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Cluster analysis using the Gene Ontology

Question : Is this cluster enriched in a particular Gene Ontology ?

r

N

Rn

N : Nb proteins in all clusters or in all human proteome

R : Nb proteins in a cluster

n : Nb proteins having this GO terms

r : Nb proteins in a cluster having this GO terms

( )

−−

=−

=

1111

)(exp.exp

Nn

NR

NRn

NRnr

ecteddeviationstdectedobservedz

Statistical estimation with a z score

z score > 0 the group is enriched in this GO termOur threshold for a significantly enrichment is

z score > 1.5

Page 92: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

1 2 3

ligand-gated ion

channel

nucleic acid binding

activity

cell differentiation

development

spermatogenesis

RNA processing

energy pathways

1 2 3 4 5

z score Human proteome

z score Cluster 2 analysis

NA binding

Transcription factor

Cell differentiation genes

1 35 1 358

wt rd1

Page 93: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Cluster 0 analysis1 2 3 4

clathrin-coated vesicle

integral to plasma membrane

plasma membrane

integral to membrane

transporter

ion channel

channel/pore class transporter

cation channel

voltage-gated ion channel

lipid binding

sodium ion transport

metal ion transport

cation transport

monovalent inorganic cation transport

ion transport

transport

synaptic transmission

signal transduction

cell communication

cell-cell signaling

intracellular signaling cascade

1 2 3 4 5 6 7 8 9

z score Human proteome

z score

Ion Transporter (ex Ca2+)

Synaptic function

Membrane

1 359 1 35

wt rd1

Page 94: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

1 2 3 4

mitochondrial membraneprotein transporter

purine nucleotide bindinganion transporter

phosphoric ester hydrolasecarrier

adenyl nucleotide bindingion transporter

ATPase

hydrolase

GTP bindingtransferase

guanyl nucleotide bindingATP binding

ATPase activity, coupledintracellular transport

intracellular protein transportprotein transport

response to abiotic stimulusvision

sensory perceptionperception of abiotic stimulusregulation of cell proliferation

fatty acid metabolismanion transport

protein modification

1 2 3 4 5 6 7 8 9 10 11 12

P-P-bond-hydrolysis-driven transporter

z score Human proteome

z score Cluster 5 analysis

Vision (Phototransduction)

Vision establisment

GTP, ATP binding, HydrolaseIon, protein transport

1 35 1 35

wt rd1

Page 95: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

1 2 3 4 5 6 7

actin cytoskeleton

microfibril

Ribonucleoprotein complex

structural constituent of eye lens

structural molecule

endopeptidase

chymotrypsin

trypsin

RNA binding

Transmembrane receptor

cAMP-dependent protein kinase

actin binding

response to stressChromosome organizationand biogenesis

establishment and/or maintenanceof chromatin architecture

cell adhesion

response to biotic stimulus

response to pest/pathogen/parasite

1 7 13 19 25 31

z score Human proteome

z score Cluster 4 analysis

Response to stress

Peptidase (ex Kallikrein)

Structural protein (ex Crystallin)

Cell Adhesion

1 35 1 35

wt rd1

Page 96: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Integrative kinetic model

Cluster 4 : Response to stress

wt rd1

Cluster 5 : Phototransduction

wt rd1

Cluster 2 : Transcription

wt rd1

Cluster 0 : Synaptic function

wt rd1Inner Retina

Outer Retina

Synapses

1 3515148 9 10 11 124 5 PN days

wtrd1

Chalmel et al. PNAS biology submitted

Page 97: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Jagged-1

(RA1906: 3.0; RA1989: 11.3)

02468101214161820

35 days8 days 15 days

wt rd1 wt rd1 wt rd1

Jagged-1

(RA1906: 3.0; RA1989: 11.3)

02468101214161820

35 days8 days 15 days

wt rd1 wt rd1 wt rd1

Jagged-1

(RA1906: 3.0; RA1989: 11.3)

02468101214161820

35 days8 days 15 days

wt rd1 wt rd1 wt rd1

Br

02468

1012141618

Arrestin(RA2654: 14.7)

wt rd1 wt rd101234567

Aquaporin 1 (RA2635: 6.3)

wt rd1

35w

t/35r

d1

0

1

2

3

Adenylate cyclase 6 (RA1742: 4.0)

wt rd1

012345678

Alkaline phosphatase (RA2627: 13.3)

α-Transducin 1(RA2715: 272)

cGMP-gated channel(RA2669: 68.9)

HRG4(RA2673: 5.0)

050

100150200250300350

020406080

100120140160

35dw

t/35d

rd1

wt rd1 wt rd1 wt rd1

0123456789

wt rd1

0102030405060

Pde6a (RA1718: 11.7)

020406080

100120140

Pde6b (RA2664: 311; RA2321: 141)

05

1015202530

Phospholipase A2R (RA2707: 8.3)

Rel

ativ

e ex

pres

sion

HSP1a (RA2646: 8.6)

02468

101214161820

wt rd1

wt rd1 wt rd1 wt rd1

0123456789

RCK(RA2642: 5.0; RA2057: 10.2)

01020304050607080

CDR2 (RA1795: 3.3; RA2630: 17.7)

Rela

tive

expr

essio

n

wt rd1

012345678

RPGRIP(RA1953: 4.5; RA2057: 21.6)

wt rd1

0.00.20.40.60.81.01.21.4

Shab(RA1730: 7.7; RA2442: 21.3)

wt rd1

wt rd1 PR

Rel

ativ

e ex

pres

sion

vers

us 3

5drd

1

Br

02468

1012141618

Arrestin(RA2654: 14.7)

wt rd1 wt rd101234567

Aquaporin 1 (RA2635: 6.3)

wt rd1

35w

t/35r

d1

0

1

2

3

Adenylate cyclase 6 (RA1742: 4.0)

wt rd1

012345678

Alkaline phosphatase (RA2627: 13.3)

α-Transducin 1(RA2715: 272)

cGMP-gated channel(RA2669: 68.9)

HRG4(RA2673: 5.0)

050

100150200250300350

020406080

100120140160

35dw

t/35d

rd1

wt rd1 wt rd1 wt rd1

0123456789

wt rd1

0102030405060

Pde6a (RA1718: 11.7)

020406080

100120140

Pde6b (RA2664: 311; RA2321: 141)

05

1015202530

Phospholipase A2R (RA2707: 8.3)

Rel

ativ

e ex

pres

sion

HSP1a (RA2646: 8.6)

02468

101214161820

wt rd1

wt rd1 wt rd1 wt rd1

0123456789

RCK(RA2642: 5.0; RA2057: 10.2)

01020304050607080

CDR2 (RA1795: 3.3; RA2630: 17.7)

Rela

tive

expr

essio

n

wt rd1

012345678

RPGRIP(RA1953: 4.5; RA2057: 21.6)

wt rd1

0.00.20.40.60.81.01.21.4

Shab(RA1730: 7.7; RA2442: 21.3)

wt rd1

wt rd1 PR

Rel

ativ

e ex

pres

sion

vers

us 3

5drd

1

Br

02468

1012141618

02468

1012141618

Arrestin(RA2654: 14.7)

wt rd1 wt rd101234567

Aquaporin 1 (RA2635: 6.3)

wt rd1

35w

t/35r

d1

0

1

2

3

Adenylate cyclase 6 (RA1742: 4.0)

wt rd1

012345678

Alkaline phosphatase (RA2627: 13.3)

α-Transducin 1(RA2715: 272)

cGMP-gated channel(RA2669: 68.9)

HRG4(RA2673: 5.0)

050

100150200250300350

020406080

100120140160

35dw

t/35d

rd1

wt rd1 wt rd1 wt rd1

0123456789

wt rd1

0102030405060

Pde6a (RA1718: 11.7)

020406080

100120140

Pde6b (RA2664: 311; RA2321: 141)

05

1015202530

Phospholipase A2R (RA2707: 8.3)

Rel

ativ

e ex

pres

sion

HSP1a (RA2646: 8.6)

02468

101214161820

wt rd1

wt rd1 wt rd1 wt rd1

0123456789

RCK(RA2642: 5.0; RA2057: 10.2)

01020304050607080

CDR2 (RA1795: 3.3; RA2630: 17.7)

Rela

tive

expr

essio

n

wt rd1

012345678

RPGRIP(RA1953: 4.5; RA2057: 21.6)

wt rd1

0.00.20.40.60.81.01.21.4

Shab(RA1730: 7.7; RA2442: 21.3)

wt rd1

wt rd1 PR

Rel

ativ

e ex

pres

sion

vers

us 3

5drd

1

Rhodopsin(RA2680: 373; RA2316: 37.1)

100.200300400500600700800900

1000

0

Rel

ativ

e ex

pres

sion

vers

us 3

5drd

1

35 days8 days 15 days

wt rd1 wt rd1 wt rd1

Rhodopsin(RA2680: 373; RA2316: 37.1)

100.200300400500600700800900

1000

0

Rel

ativ

e ex

pres

sion

vers

us 3

5drd

1

35 days8 days 15 days

wt rd1 wt rd1 wt rd1

Rhodopsin(RA2680: 373; RA2316: 37.1)

100.200300400500600700800900

1000

0

Rel

ativ

e ex

pres

sion

vers

us 3

5drd

1

35 days8 days 15 days

wt rd1 wt rd1 wt rd1

Contol: known genesFirst validation : Real-time RT-PCR analysis

Data : relative expression level versus rd1 retina at PN35. N° after RA entry: ratio wt/rd1 at PN35 as scored by Affy. probes

Page 98: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

In situ hybridization performed on frozen retinal sections using riboprobes. The specificity of the signal using antisense probes was controlled using the corresponding sense probe.

Second validation : In situ hybridization

Page 99: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Develoments : Automated statistical and biological inter-relational analysis

1ère étape : Classification automatique et standardisée

- New statistically funded clustering algorithms- Scoring methods for informational enrichment evaluation- Automated phylogenetic and transcriptomic promoter analysis - Data Driven Clustering: ClustPack and Gscope links

Cross-Validation

Page 100: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

16CC3 (Q96DA0, Hypothetical protein - 16p13.3)Four Procure targets : hypothetical proteins (EX0062, EX1061, EX0468, EX0127)

2) TRANSFAC® Professional 8.1 (2004-03-31)

Developments: Promoter Analysis1) validated TSS in various complete or ongoing genomes

Matrix Search for Transcription Factor Binding SitesHigh quality multiple

alignment of the TSS regions

Phylogenetic footprinting and shadowing

prediction validationstatistical cross-validation…

Androgen Receptor binding site

Page 101: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

STRING - Search Tool for the Retrieval of Interacting Genes/Proteins

Page 102: BioInformatique : Transcriptomique IIlecompte/cours/CoursTranscrII_Sept2010.pdf · Hierarchical Cluster Analysis. Metrics. for measuring distances : Euclidean distance Correlation

Laboratory of Integrative Genomics and BioinformaticsIGBMC, Strasbourg