bioinformatique : transcriptomique iilecompte/cours/courstranscrii_sept2010.pdf · hierarchical...
TRANSCRIPT
BioInformatique : Transcriptomique IIULP 2010
Wolfgang Raffelsberger Olivier Poch
Laboratoire de Bio-Informatique et Génomique IntégrativeIGBMC, Strasbourg
Summary : Principal Steps in Transcriptomics(DNA microarrays)
© Wolfgang Raffelsberger, IGBMC, 2008
mRNA,QC
hybridizelabelled cDNAon microarray
AAA
fluorescenceimaging
(Summarize,)Normalize
Filterinformative /
non-informative genes
Clustering
Statistical inference testing
Quality Control
BiologicalQuestion
Steps and Options in Data Analysis
Generation of Gene Raw Expression Data
Data Preprocessing : Eliminate BiasNormalization (Scaling) : linear , non-linear (LOWESS)
Inferential Statisticsassign confidence to discovery of regulated genes
Descriptive StatisticsExploratory techniques
Define Similarity MeasurePearson coefficient of correlationEuclidean distance
Unsupervised (Discovery)Clustering : hierarchical,
k-means, SOMPCA, Mult.Dim.Sca.(MDS)
Supervised : Classificationsupport vector machines (SVM)linear discriminant analysis (LDA)decision trees
Parametric Testst-TestANOVA
Non-Parametric TestsWilcoxonMann-Whitney
A B50 10070 40... …
Biological Question
Nicolas Wicker
Statistical Tests for Differential Gene Expression: Nicolas Wicker
• Gaussian distribution ? → Kolmogorov test
PFER Per-Family Error RatePCER Per-Comparison Error RateFWER Family wise Error RateFDR False Discovery Rate
▫ t -Test with correction - Bonferroni (for n times H0 Hypothesis) - extremely stringent !- FDR : step-up procedure (Benjamini & Hochberg 1995) - less stringent
▫ Multiple Groups : ANOVA , F-statistics
• Adjusted p-Values (Westfall & Young 1993)Test level (size) not to be determined in advanceStep-down procedure, less conservative than Bonferroni, Holm or Hochberg
• Neighborhood analysis (Golub et al) : weak control of FWER
• SVM, Bayesian approaches
▫ Wilcoxon (-Mann-Whitney)
▫ Permutation Tests (Significance Analysis of Microarrays) : 2 versions - Efron (2000) weak control of PFER- Tusher (2001) strong control of PFER
non-Gaussian
Myth“Microarray investigations are unstructured
data-mining adventures without clear objectives”
• Good microarray studies have clear objectives,avoiding gene specific mechanistic hypotheses
• Design and Analysis methods tailored to study objectives
Solution
Data Filtering
• Numerous algorithms require data reductionCPU & RAM used increases ~exponential with number of genes processed
• However : The mass of constant genes contains information about noisiness of system, background ...
• Filtering to remove poor and irrelevant data
Filters : low- intensity cutoff, genes with low inter-sample variability, replicate var. trimming, intensity-dependent z-score cutoffs, Flags or Calls, dedicated tools & packages
© Wolfgang Raffelsberger, IGBMC, 2008
• non-DE genes decreases statistical sensitivity of detection
Steps and Options in Data Analysis
Generation of Gene Raw Expression Data
Data Preprocessing : Eliminate Bias, FilteringNormalization (Scaling) : linear , non-linear (LOWESS)
Inferential Statistics Descriptive Statistics
Define Similarity Measure
Unsupervised Clustering Supervised : Classification
Parametric Tests Non-Parametric Tests
A B50 10070 40... …
Biological Question
Statistical Tests and Clustering Pipeline
platforms & tools :
Clusteringdifferential expression
Functional Enrichmentof Groups /Clusters
Statistical Testsdifferential expression,
Outlyer values
filtering
© Wolfgang Raffelsberger, IGBMC, 2008
raw data processing
Basic Definition of “Cluster”“ 1) close group of similar things growing together2) close group of people, animals, stars, ... etc. ”
Oxford Encyclopedic Dictionary
• Arrange a population of elements based on their properties
Bioinformatics Context
What is Clustering ?
• Arrange genes in a way where similarly behaving genes get placed close/next to each other
- gene expression patterns
- gene location on a genome
- based on functional domains or protein-interactions© Wolfgang Raffelsberger, IGBMC, 2008
Hierarchical Cluster Analysis
Data : Measure all distances between individual points
DescendingInitially : 1 class
Create new classfor most distant (i+1)
AscendingInitially : n classes
Combine the closestto one class (i -1)
Build Dendrogram by branching off classes
© Wolfgang Raffelsberger, IGBMC, 2008
51234
2153
4
equivalent to
Hierarchical Cluster Analysis
Metrics for measuring distances :
Euclidean distance Correlation (Pearson) Spearman rank, …
• Single Linkageminimum cluster-to-cluster distancebetween members of one cluster to members of other cluster
• Complete Linkagemaximum cluster-to-cluster distance
• Average Linkageaverage cluster-to-cluster distance
• Centroid Linkageaverage of all individual points distances
Hierarchical Clustering : Average Linkage
Different metrics available for measuring distance :Euclidean distance, Correlation (Pearson), Spearman rank, …
© Wolfgang Raffelsberger, IGBMC, 2008
72 h
6 h
1 h
Self - Organizing Maps (SOMs)or Topological Maps (Kohonen)
wr, modified from Tamayo et al, 1999
Example for a 3 x 2 geometry :
data points
initial grid (rectangular)
node migration (hypothetical)
node (final)
1
3
2
4
5 6
Partitioning Method :
devide in k classes and distribute (optimized)
following a predefined criteria (grid)
n observations, K classes : partitionsKN
K !
Graphical Cluster Profile Representation
• Cluster Profile(option Box-Plot)
• all individual genes
Cho, 1998Yeast cell cycle
n = 3000k Means
More Options in Clustering
• Determination of number of clusters- not all data must be forced in clusters !- apply stop-rule during clustering - Principal Component Analysis to determine number of clusters
• Clustering methods capable of analyzing :- non Gaussian distributions- clustering qualitative ‘data’ (e.g. domains, literature-citations … )
Results of PCA : small set of the orthogonal (independent)Variables containing most of the variance
y1
y2 corr (x1, x2)= 0.90
- Search axis with highest correlation
x2
x1
- Define this as (First) Principal Component (y1 captures the major fraction (~90%) of the dispersion)
- Take 2nd axis orthogonal ( i.e. 0 correlation)
Principal Component Analysis : Example
Comparison : Array-data and real-time PCR
Myc Bfl1 Lsp1 TGM2 Elan G1P3
Array real-time PCRx-fold gene induction
12 additional genes showed no changes by both methods
6
4
10
2
8
no change-2
75
100
50
NB4, untreated, A
NB4, untreated, BNB4, 18h atRA, C
x-fold gene induction
25
125
>200
Myc Bfl1 Lsp1 TGM2 Elan G1P3
no change
Raffelsberger et al, Genomics 2002Raffelsberger et al, Genomics 2002
Accuracy and Precision
good precision low precisionlow accuracy good accuracy
© Wolfgang Raffelsberger, IGBMC, 2008
reproducibility
check by replicates
poor precision results :poor technique
correctness
check by different method
poor accuracy results : procedural or equipment flaws
Web Resources : General InformationBrown Lab Guide
http://cmgm.stanford.edu/pbrown/Microarrays protocols and arrayer construction.
DeRisi Labhttp://www.microarrays.org/
Software : ArrayMaker, ScanAlyze,Cluster, Treeview
SMD guidehttp://genome-www5.stanford.edu/
Stanford Microarray Database, links
NIH microarray projecthttp://research.nhgri.nih.gov/microarray/main.html
Protocols, software (ArraySuite)
EBIhttp://www.ebi.ac.uk/microarray/
Protocols, software
TIGRhttp://www.tigr.org/tdb/microarray/
Protocols, software
IGBMChttp://www-microarrays.u-strasbg.fr/
General information, IGBMC related details
gene-chipshttp://www.gene-chips.com/
Software ArrayTack (integrated software system for managing, mining, and visualizing microarray gene expression data)
Davidson College Genomicshttp://www.bio.davidson.edu/
General information, flash animation
Web Resources : Data Analysis Tools
Eisen Lab Michael Eisen's suite for image quantitation and data analysis(Scanalyze, Cluster, TreeView).
Expression Profiler Online clustering and analysis tools (EBI)
GenEx Database, repository and analysis tools (NCGR)
MAExplorer MicroArray Explorer for data mining Gene Expression
TM4http://www.tm4.org/
TIGR Microarray Data Analysis System (MIDAS) : data quality filtering and normalization tool for data processing (various normalizations, filters, and transformations) analysis pipeline
MAXD Downloadable data warehouse and visualisation for expression data
BioInformatics TU Grazhttp://genome.tugraz.at/
Analysis tools Genseis
Jexpress Java tools for gene expression data analysis
Steps and Options in Data Analysis
Generation of Gene Raw Expression Data
Data Preprocessing : Eliminate BiasNormalization (Scaling) : linear , non-linear (LOWESS)
Inferential Statisticsassign confidence to discovery of regulated genes
Descriptive StatisticsExploratory techniques
Define Similarity MeasurePearson coefficient of correlationEuclidean distance
Unsupervised (Discovery)Clustering : hierarchical,
k-means, SOMPCA, Mult.Dim.Sca.(MDS)
Supervised : Classificationsupport vector machines (SVM)linear discriminant analysis (LDA)decision trees
Parametric Testst-TestANOVA
Non-Parametric TestsWilcoxonMann-Whitney
A B50 10070 40... …
Biological Question
Biological Question, Analysis, Interpretation, Verification …
Transcriptomics in Biology: Jan-Oct 2009
Transcriptomics in Biology : 01 Sept - 06 Oct. 2008!!!Molecular pathways : induced/non induced by mutation, drug, disease, infection…Cell cycle, apoptosis, antigen presentation, immune response tolerance to hypoxia, TF pathways (NR, Homeobox, Map kinase pathways…), bacterial growth, Immunology & malaria, hypoxia & cancer, prefrontal & alcoholism, virus & macrophage survival, biological clocks & drugs, superoxide dismutase & brain, neuroprotection, neo-vascularisation, plant resistance to virus or fungi,
Molecular Diagnosis : disease/control cells, populationCancer : prostate, gastric, lung, breast, colorectal, lymphome, ovarian, brain Pathways : Map kinase, cell channel)Susceptibility & Risk evaluation : obesity, cancer, autism, psoriasis
Progression/Prognostic : affected /control, time series, populationParticular genotype implication in progression/survival : Cancer (metastase), disease (autism, schizophrenia , alzheimer, hodgkinSusceptibility & Risk evaluation : cancer, alcoholism, graft rejection, allograft rejection, blood cell transplantation, retroviral integration, neuroplasticity (cocaine)
Development : time seriesOrgans : brain, liver, hepatocyte, spermatogenese, kidneyBiological clocks, molecular repertoire of cell types
Evolution : various speciesHuman/Chimp, human/mouse, response to exogenous DNA!, microarray & species origin!
Behaviour, Ecology, Technology : population, cellsAggressiveness, mating, senescence, ecosystem (bacteria + archaea + plants), tissue disruption & RNA level, benchmark comparison (yeast, tissue, single cell)
Transcriptomics in Biology
Microarrays coupled to :
° Biochemistry : enzymatic activities° Proteomics ° Interactomics° FISH° Immunohistochemistry : prostate° Genotyping : autism° Transcription Cell Arrest° Comparative Genomic Hybridisation (CGH)
Ioannidis et al: Repeatability of published microarray gene expression analyses
Ioannidis et al: Repeatability of published microarray gene expression analyses
18 articles (using microarrays, Nature Genetics 2005-2006)
Evaluate by 2 independent teams
Number Outcome Comments
2 repeatable
6 partially repeatable Data Annotation
10 not repeatable Data NOT Available
16 declared that data are publicly available 13 had GEO number
Ioannidis et al: Repeatability of published microarray gene expression analyses
18 articles (Nature Genetics, 2005-2006)
16 declared that data are publicly available
13 had GEO number
8 Unprocessed raw data available
4 Sufficient detail on data processing documented
(others : limited or ambigous)
2 Not possible to know array platform !
2 Not possible to associate which array to which sample
2 Minor differences in results from independent analysts
5 Discrepancies in results from independent analysts
1 Major discrepancies in results from independent analysts
Reproducibility : within laboratory good,
between platforms and across laboratories: generally poor.
Major lab-effect (in particular with two-color platforms)Overall agreement : apr. 30-40 % [33% - 81%]
Standardisation (MIAME, Toxicogenomics Research Consortium)RNA expression data comparison : 7 laboratories, 2 standard RNA samples using12 microarray platforms. 2 standard microarray types(one spotted, one commercial) were used by all labs.
Reproducibility increased markedly : Standardized protocols for all steps (RNA labeling, hybridization, microarray, processing, data acquisition and data normalization).
Reproducibility highest : When analysis based on biological themes defined by enriched Gene Ontology (GO) categories.
Need : guidelines for publishing microarray data through the Minimal Information About Microarray Experiments (MIAME) standards
Minimum Information About Microarray Experiments -MIAME
• Specify minimum information that must be reported with gene expression monitoring experiments
developed by the MGED working group
• Ensure Interpretability of results
• Facilitate potential verification
• Facilitate establishing repositories
• Enable unique data exchange format
• Many scientific journal require data submission following MIAME standards
GEOhttp://www.ncbi.nlm.nih.gov/geo/
(NCBI) Gene expression data repository & online resources
ArrayExpresshttp://www.ebi.ac.uk/arrayexpress
EBI database & analysis tools
Chip DBhttp://young39.wi.mit.edu/chipdbpublic Whitehead Institute array database
ExpressDB Harvard, specialized on yeast (and E. coli) data
RAD RNA Abundance Database
Stanford Microarray Databasehttp://dnachip.org Stanford Database & online tools
MAdbhttp://www.cbil.upenn.edu/rad2/servlet National Cancer Institute
GXDhttp://informatics/jax.org/ Jackson Laboratory (specialized on mouse)
Web Resources : Public Databases
http://www.ncbi.nlm.nih.gov/projects/geo/
GEO http://www.ncbi.nlm.nih.gov/geo
Query / Data Sets : 1574 → GDS1574 recordEndotoxin effect on leukocytes: time course (HG-U133A) [Homo sapiens]
Try Clustering (Hierarchical or k-Means)
Statistical testing : Compare 0h to 2h (or 4h)
Can you find inflammation related genes ?Stat1-6, MAP kinases, Interleukins, NFκB, ILR1&2, TLR1-5, …
http://www.ebi.ac.uk/arrayexpress
ArrayExpress Conceptual Overview
Referencee.g. publication, www
Sample-Ontologye.g. organism, taxonomy
Nucl & Prot DBSwissProt, gb, …
Hybridizationprotocol, …
Expression Valuepos., flouresc. intens.
Experimentname, type, …
Arraytype, elements, …
Sampleorigin, prep, …
Stanford Microarray Database
Analyze printed arrays on Stanford Microarray Database
Open web-browserand go to :
http://genome-www5.stanford.edu/
Stanford data base 1
Stanford data base 1
Stanford data base 1
Analyze printed arrays on Stanford Microarray Database
Try to test some parameters that may be useful for filtering out non-informative information(signals to close too background, …)
Are ratios dependent on signal intensity ?(this would indicate possible need for additional lowess normalization…)
Show effect of normalization on histogram of data
Stanford data base 1
Try to plot all data effect of Filter 1 effect of Filter 2 combine Filter 1&2
Stanford data base 1
no filteringn=37013
Filtering cor + ch1&2 S/N
n=6886
Filtering cor>0.6n=10537Filtering ch1 S/Nn=13868
Stanford data base 1
Stanford data base 1
use as default …
Stanford data base 1
this may take a few minutes … next : Proceed to Clustering
Stanford data base 1
this may take a few minutes … next : Proceed to Gene Filtering… filters to 412 seq_IDs and 20 arrays
Stanford data base 1
you may try as default …
72 h
6 h
1 h
72 h
6 h
1 h
TP stanford DB
SMD Sub-Cluster List1SMD_ID UniGene GeneID Genbank H3009A06 Mm.29778 Armet BG0635801600000K21 Mm.29778 Armet AV033958 H3130B10 Mm.227583 Atp2a2 BG0740442010000C19 Mm.238973 Atp5b AV061729 H3056G04 Mm.260325 Bst2 BG067655UI-M-AO0-acg-a-04-0-UI Mm.260325 Bst2 AI841611 H3098E07 Mm.277665 Calb1 BG071405H3097C12 Mm.257765 Clic4 BG071310UI-M-AO0-acf-h-12-0-UI Mm.298875 Clta AI838788 1500040H21 Mm.41614 Cmtm5 AV033409 1810022N06 Mm.292567 Creld2 AV052673 H3023F06 Mm.296789 Entpd7 BG064768 H3060C06 Mm.88793 Fga BG067969H3154E07 Mm.3982 Gas6 BG076011L0079H05 Mm.371563 H3f3b AW551507 H3013D08 Mm.371563 H3f3b BG0639232610017D13 Mm.371563 H3f3b AV112938 1500015A02 Mm.371563 H3f3b AV030583 C0124D04 Mm.371563 H3f3b AW539780 UI-M-AM0-adv-c-03-0-UI Mm.371563 H3f3b AI841334IMAGE:699319 Mm.1351 Hoxc4 AA245472 H3032A08 Mm.330160 Hspa5 BG065517H3015H11 Mm.259672 Ibrdc3 BG0641402310031I05 Mm.34871 Id2 AV087554 H3118H06 Mm.34871 Id2 BG073126H3098G05 Mm.34871 Id2 BG0714213200002K22 Mm.110 Id3 AV171079 2310022H23 Mm.24769 Ifi47 AV086656 H3129G05 Mm.253335 Ifrg15 BG074005H3038A06 Mm.253335 Ifrg15 BG066021H3135C12 Mm.211654 Ifrg15 BG074451H3025A01 Mm.253335 Ifrg15 BG064887H3157D02 Mm.33902 Igtp BG0762451110065O20 Mm.33902 Igtp AV016524 H3050H06 Mm.105218 Irf1 BG067127IMAGE:555853 Mm.285 Kdr AI3250281810059C02 Mm.3152 Lgals3bp AV059520 H3108B10 Mm.294753 Litaf BG0722271700010C22 Mm.244890 Lrrc6 AV039925 H3027D05 Mm.788 Ly6e BG065103H3143C12 Mm.1597 Mov10 BG075103J0806F10 Mm.270331 Msi2 AU040439 1810074C23 Mm.192991 Mt1 AV061461 H3020C02 Mm.192991 Mt1 BG0644802900001J17 Mm.192991 Mt1 AV149953 H3013D11 Mm.147226 Mt2 BG0639252310006N23 Mm.147226 Mt2 AV084363 H3107C06 Mm.213003 Myd88 BG072146H3008E07 Mm.250418 Ogfr BG0635421020010I05 Mm.49074 Parp9 AV006271 J1021F01 Mm.17932 Pnp AU042511 H3116A06 Mm.27769 Pnrc1 BG072875H3114E02 Mm.18347 Psmd7 BG072746H3059F12 Mm.15793 Psme2 BG067921IMAGE:466539 Mm.262 Rhoc AI324259
Igtp : interferon gamma induced GTPase
Armet : arginine-rich, mutated in early stage tumors
H3f3b : H3 histone, family 3B (H3.3B)
Hoxc4 : homeo box C4
ID2 : inhibitor of DNA binding 2, dominant negative helix-loop
Ifrg15 : interferon alpha responsive gene
Irf1 : interferon regulatory factor 1
Mt1 : metallothionein 1
Mt2 : metallothionein 2
Myd88 : myeloid differentiation primary response gene 88
Pnp : purine-nucleoside phosphorylase
Bst2 : bone marrow stromal cell antigen 2
Creld2 : cysteine-rich with EGF-like domains 2
Gas6 : growth arrest specific 6
Psme2 : proteasome (prosome, macropain) 28 subunit, beta
Data - Mining ?
Finding Gene Interactions
• “dimensionality reduction” : find groups of related genes
Coexpression clustering
Mining Promoter elements
Gene Ontology or other knowledge-bases
Connections to public databases (Genbank, Omim, PDB, patents,...)
Metabolic pathways
Interactome networks
Literature & co-citations
Relations to linkage data
Aim :Establishment of a standardized vocabulary with hierarchical organization to describe the role of genes and proteins.
Advantages :Standardization :
→ clear and distinctive definition of functions→ comparative genomics at functional level
The brief and unique descriptions, exploitable manually or by programs
3 GO-Ontologies :• Molecular Function : transcription factor, DNA polymerase, ...
• Biological Processes : mitosis, replication, ...
• Cellular Component : localization, complexes, ...
http://www.geneontology.org/Gene Ontology (GO)
http://david.abcc.ncifcrf.gov/David : Clustering of Enriched Gene Ontology (GO) terms
Gene NetworksBased on KEGG
http://www.ariadnegenomics.com/products
PathwayStudio
http://www.genome.jp/kegg/
developed at EMBL, SIB and UniZH
Snel B, et al (2000) NAR : STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene.v Mering C, et al (2007) NAR : STRING 7- recent developments in the integration and prediction of protein interactions.
Gene Networks : STRING http://string.embl.de/
Gene NetworksBased on Ingenuity Pathway Maps (IPA)manually mined & curated database of interactionshttp://www.ingenuity.com
Thiersch et al, 2008
More Pathway Programs
Grid (www.dbgrid.org/ )
Mint (mint.bio.uniroma2.it/mint )
Cytoscape (http://www.cytoscape.org)
HiMap (http://www.himap.org/)
GenMAPP (www.genmapp.org)
Bind (www.bind.ca)
BioCarta (www.biocarta.com)
BPS: Biochemical Pathway Simulator(http://www.brc.dcs.gla.ac.uk/projects/bps/links.html)
BioPax (http://cbio.mskcc.org/tools/all.html)
BIOINFORMATIC ANALYSIS OF GENE EXPRESSION IN THE DEGENERATING RETINA OF THE rd1 MOUSE, A MODEL OF
RETINITIS PIGMENTOSA
Thierry Léveillard, José SahelLaboratoire de Physiopathologie Cellulaire et Moléculaire de la Rétine,
Inserm U592, Université Pierre et Marie Curie
Frédéric Chalmel, Olivier PochDépartement de Biologie et Génomique Structurales
Groupe de Bioinformatique et Génomique Intégratives
Christian Lavedan, George LambrouNovartis Institutes for Biomedical Research
La rétine
LumièreLumière
Sou
rce
http
://w
ebvi
sion
.med
.uta
h.ed
u/
Photorécepteurs
Bipolaires
Ganglionnaires
Épithélium pigmentaire
Membranes internes
2 types de photorécepteurs :
Bâtonnets (H:95 %) vision périphérique et nocturne
Cônes ( H:5 %) vision de précision, colorée et diurne
BâtonnetsCônes
Cone densities in human retina
Normal human retina Retinitis pigmentosa
Eye fundus examination
Prevalence of retinal degenerations
• Retinitis Pigmentosa (RP):Monogenic40,000 patients in France
• Aged-related Macular Degeneration (AMD) :Multifactorial and polygenic2 millions patients in FranceIncidence : 10% at 65, 25% at 75, 60% at 90
La retinitis pigmentosaMaladie monogénique (158 loci connus); 30 000 à 40 000 patients en France
Rétine humaine normale atteintePremier symptômes :
Adaptation à l’obscuritéChamp visuel périphérique
Perte des bâtonnets
15 à 30 ans
Cécité complète
Perte secondaire des cônes
Cécité
The photransduction cascade
The rd1 mouse model (and human recessive RP)
GCα
γ
βm
GMP
cGMPLight Dark
Apoptosis (Chang et al, Neuron, 1993; Portera-Caillaux et al, PNAS, 1993)
?
La souris rd1 : Un bon modèle pour l’étude de la retinitis pigmentosa humaine
1 mutation sur la βPDE Augmentation [cGMP]
Perte brutale des bâtonnets de 10 à 35 Jour Post-Natal (JPN)
Perte progressive des cônes (3 mois)
Cônes
0123456
3 155 7 11 191
Nombre de PR dans une section de rétine
Semaines Post-Natal
Bâtonnets
JPN8 JPN9 JPN15 JPN35JPN5
Sou
ris s
auva
geS
ouris
rd1
Souris rd1 : très bon modèle qui mime la maladie humaine
βPDE
Cones need rods to survive
Rod therapies
Cone therapies
Élaboration de stratégies thérapeutiques des dégénérescences rétiniennes
Un modèle la souris rd1
Perte des bâtonnets
Perte des cônes :
Mutation
Analyses différentielles :• Transcriptome• Protéome
Criblages fonctionnels :• Clonage des facteurs de viabilité des cônes• Identification des signaux de régénérescences • Identification de ligand d’un récepteur nucléaire (PNR)
Modèles transgéniques :• La souris• La drosophile
Identification des mécanismes moléculaires de
dégénérescences et desmécanismes endogènes de
neuro-protection
Solutions
thérapeutiques
CECITE
Method to study rd1 : Affymetrix DNA chips
cRNAoligos
PMMM
probe pair
probe set
MisMatch
Perfect Match TGAACTAGCTCGTACCGCTACGGAA
TGAACTAGCTCGXACCGCTACGGAA
Interrogation of 30,000 of mRNA
transcripts
Each probe cell contains millions of copies of a specific
oligonucleotide probe (25 mer)
Each transcript is represented by multiple
(20) probe pairs
}
Mouse Affymetrix DNA chips
wt strain
rd1 strain
1d 4d 5d 8d 9d 10d 11d 12d 14d 15d 35d
Rod loss
Birth
Expression level Rhodopsin
36700 clones with 2 x 11 affymetrix probe sets of expression
Post-natal week
Rods
Cones
0
1
2
3
4
5
6
3 155 7 11 191
Cell number
Pre-analysis of the data
36700 probe sets
Clones with maximal
expression level < 100
Clones with
standard deviation < 1023504 probe sets
Goals :
• Group genes with close expression profiles
• Select clusters with a significantly difference between both strains
• Analyse of each clone individually and characterized their respective cluster
Quality Index (W. Raffelsberger RetNet RTN)
• Information about reliability of measurement- Homogeneity within probe-set- Replicate measurementsto be considered in combination with degree of gene regulation
• Challenge : Data on logarithmic distribution containing negative values
• CV (s/μ) good for Signal Intensities >1 distortion/bias at signal intensities -1 < x < +1
• New method based on tanh- No Bias for values <1- Same procedure used for probe-sets homogeneity andreplicates
• Each measure gets corresponding Quality Index
Replicate Quality Indicator (rqi)
• Comparison of CV (s/μ) and tanh-based method relative to signal intensity - No Bias for values <1- Same procedure used for probe-sets homogeneity and replicates
rqi
Signal Intensity
probe-set Quality Indicator
probe-set SE
Avoiding Artifacts in Differential Regulation
• Low level gene regulation : in most cases artifacts
• High level gene regulation : true regulation even if semiquantitativebiological conclusion remains same
• Integrated Model
Result : low stringency at high gene regulationmoving gradually to
high stringency at low regulation
35d wt
1d wt
1d rd1
15d wt
10d wt
12d rd1
Clustering, DPC program Density of Point Clustering
22 conditions 22 dimensions
Clustering step
Cluster 1
Cluster 2
Cluster 3
23504 Probe Sets
Normalised level expression
wt strain rd1 strain
1 351 35
7 clusters grouping genes with close expression profiles
Cluster 0 : 2483 PS
Cluster 1 : 5535 PS
Cluster 2 : 1768 PS
Cluster 3 : 3904 PS
Cluster 4 : 2323 PS
Cluster 5 : 2459 PS
Cluster 6 : 5032 PS
Genes
wt rd1 wt rd1
Clusters selection with differential profile between wt/rd1 expression (Hotelling t2 test)
Cluster 0
2483 PS
D = 143
Cluster 1
5535 PS
D = 21
Cluster 2
1768 PS
D = 240
Cluster 3
3904 PS
D = 20
Cluster 4
2323 PS
D = 432
Cluster 5
2459 PS
D = 467
Cluster 6
5032 PS
D = 21
4 selected clusters
9033 Probe sets
23504 PS
-2,0
-1,0
0,0
1,0
2,0
1 4 5 8 9 101112 1415 35
Groupe 0: 2483 clones
1 4 5 8 9 1011 12 14 15 35-2,0
-1,0
0,0
1,0
2,0Groupe 2: 1768 clones
-2,0-1,00,01,02,0
1 4 5 8 9 101112 1415 35
Groupe 4: 2323 clones3,0
-2,0
-1,0
0,0
1,0
2,0
1 4 5 8 9 1011 12 1415 35
Groupe 5: 2459 clones
wt micerd1 mice
Most differencially expressed probe set selection
JPN35 334 clones3wt
1≥
rd
JPN35 494 clones31
wt≥
rd
JPN9 251 clones31
wt≥
rd
JPN8 245 clones3wt
1≥
rd
1324
4 groupes et 1324 target probe sets
Mouse Affymetrix data
Clustering step
Clusters selection Hotelling t2 test
1324 Probe sets
Clusters analysis by statistical validation z score
Overview of analysis protocol
Gene Ontology (GO)
Proteins domains
Tissue type & Dev. stage
Human localisation(s)
Large amount of “linked” biological data(genome, 3D complex, organelle, I.P., differential display,
“omics” data…)
Common, universal analyses and cross-validation:
* databases searches* complete alignment & differential conservation
* data over-representation (z-score)……
Specialized analyses, cross-validation, outputs according to the biological link:
* co-expression (data or database)* synteny, comparative genomics (genome)* spatial confinement (IP, interactomics…)
* target characterisation (structural project…)…
Functional Genomics : Gscope platform
Integration, validation and analysis of high throughput information
DNA and/or Proteome
Pipe-Align
Database Searches• BlastP on Swissprot, TrEmbl, PDB• BlastX for missing ORFs• TBlastN on complete genomes
DNA processing• ORF location• External programs (Glimmer, tRNAscan, CodonW...)
Initial GScope database creation
Common steps
Clustering schemes
General Information ValuesORF (Overlap, length,…) GC contentCodon UsageShineDalgarno presenceStart codon (M, V or L)
Homologue Counts- overall- structures- paralogues- in complete genome
Gscope: genomic platform
Sequence validation :Homolog detection agreementValidated start codon
Phylogenetic relationships :Gene cluster maintenanceGene lossesDistance tree analysis
Structural information :Domain organisationProduction in E.coli, YeastHydrophobicity indexHydrophobic helices…
Target identification:X-HDA analysisValidated GO (z score)Integrated synteny
Alvinella annotation :cDNAs & target characterision
Data cross-correlationPredictions
Specialised steps
Cluster analysis using the Gene Ontology
Question : Is this cluster enriched in a particular Gene Ontology ?
r
N
Rn
N : Nb proteins in all clusters or in all human proteome
R : Nb proteins in a cluster
n : Nb proteins having this GO terms
r : Nb proteins in a cluster having this GO terms
( )
−−
−
−
−
=−
=
1111
)(exp.exp
Nn
NR
NRn
NRnr
ecteddeviationstdectedobservedz
Statistical estimation with a z score
z score > 0 the group is enriched in this GO termOur threshold for a significantly enrichment is
z score > 1.5
1 2 3
ligand-gated ion
channel
nucleic acid binding
activity
cell differentiation
development
spermatogenesis
RNA processing
energy pathways
1 2 3 4 5
z score Human proteome
z score Cluster 2 analysis
NA binding
Transcription factor
Cell differentiation genes
1 35 1 358
wt rd1
Cluster 0 analysis1 2 3 4
clathrin-coated vesicle
integral to plasma membrane
plasma membrane
integral to membrane
transporter
ion channel
channel/pore class transporter
cation channel
voltage-gated ion channel
lipid binding
sodium ion transport
metal ion transport
cation transport
monovalent inorganic cation transport
ion transport
transport
synaptic transmission
signal transduction
cell communication
cell-cell signaling
intracellular signaling cascade
1 2 3 4 5 6 7 8 9
z score Human proteome
z score
Ion Transporter (ex Ca2+)
Synaptic function
Membrane
1 359 1 35
wt rd1
1 2 3 4
mitochondrial membraneprotein transporter
purine nucleotide bindinganion transporter
phosphoric ester hydrolasecarrier
adenyl nucleotide bindingion transporter
ATPase
hydrolase
GTP bindingtransferase
guanyl nucleotide bindingATP binding
ATPase activity, coupledintracellular transport
intracellular protein transportprotein transport
response to abiotic stimulusvision
sensory perceptionperception of abiotic stimulusregulation of cell proliferation
fatty acid metabolismanion transport
protein modification
1 2 3 4 5 6 7 8 9 10 11 12
P-P-bond-hydrolysis-driven transporter
z score Human proteome
z score Cluster 5 analysis
Vision (Phototransduction)
Vision establisment
GTP, ATP binding, HydrolaseIon, protein transport
1 35 1 35
wt rd1
1 2 3 4 5 6 7
actin cytoskeleton
microfibril
Ribonucleoprotein complex
structural constituent of eye lens
structural molecule
endopeptidase
chymotrypsin
trypsin
RNA binding
Transmembrane receptor
cAMP-dependent protein kinase
actin binding
response to stressChromosome organizationand biogenesis
establishment and/or maintenanceof chromatin architecture
cell adhesion
response to biotic stimulus
response to pest/pathogen/parasite
1 7 13 19 25 31
z score Human proteome
z score Cluster 4 analysis
Response to stress
Peptidase (ex Kallikrein)
Structural protein (ex Crystallin)
Cell Adhesion
1 35 1 35
wt rd1
Integrative kinetic model
Cluster 4 : Response to stress
wt rd1
Cluster 5 : Phototransduction
wt rd1
Cluster 2 : Transcription
wt rd1
Cluster 0 : Synaptic function
wt rd1Inner Retina
Outer Retina
Synapses
1 3515148 9 10 11 124 5 PN days
wtrd1
Chalmel et al. PNAS biology submitted
Jagged-1
(RA1906: 3.0; RA1989: 11.3)
02468101214161820
35 days8 days 15 days
wt rd1 wt rd1 wt rd1
Jagged-1
(RA1906: 3.0; RA1989: 11.3)
02468101214161820
35 days8 days 15 days
wt rd1 wt rd1 wt rd1
Jagged-1
(RA1906: 3.0; RA1989: 11.3)
02468101214161820
35 days8 days 15 days
wt rd1 wt rd1 wt rd1
Br
02468
1012141618
Arrestin(RA2654: 14.7)
wt rd1 wt rd101234567
Aquaporin 1 (RA2635: 6.3)
wt rd1
35w
t/35r
d1
0
1
2
3
Adenylate cyclase 6 (RA1742: 4.0)
wt rd1
012345678
Alkaline phosphatase (RA2627: 13.3)
α-Transducin 1(RA2715: 272)
cGMP-gated channel(RA2669: 68.9)
HRG4(RA2673: 5.0)
050
100150200250300350
020406080
100120140160
35dw
t/35d
rd1
wt rd1 wt rd1 wt rd1
0123456789
wt rd1
0102030405060
Pde6a (RA1718: 11.7)
020406080
100120140
Pde6b (RA2664: 311; RA2321: 141)
05
1015202530
Phospholipase A2R (RA2707: 8.3)
Rel
ativ
e ex
pres
sion
HSP1a (RA2646: 8.6)
02468
101214161820
wt rd1
wt rd1 wt rd1 wt rd1
0123456789
RCK(RA2642: 5.0; RA2057: 10.2)
01020304050607080
CDR2 (RA1795: 3.3; RA2630: 17.7)
Rela
tive
expr
essio
n
wt rd1
012345678
RPGRIP(RA1953: 4.5; RA2057: 21.6)
wt rd1
0.00.20.40.60.81.01.21.4
Shab(RA1730: 7.7; RA2442: 21.3)
wt rd1
wt rd1 PR
Rel
ativ
e ex
pres
sion
vers
us 3
5drd
1
Br
02468
1012141618
Arrestin(RA2654: 14.7)
wt rd1 wt rd101234567
Aquaporin 1 (RA2635: 6.3)
wt rd1
35w
t/35r
d1
0
1
2
3
Adenylate cyclase 6 (RA1742: 4.0)
wt rd1
012345678
Alkaline phosphatase (RA2627: 13.3)
α-Transducin 1(RA2715: 272)
cGMP-gated channel(RA2669: 68.9)
HRG4(RA2673: 5.0)
050
100150200250300350
020406080
100120140160
35dw
t/35d
rd1
wt rd1 wt rd1 wt rd1
0123456789
wt rd1
0102030405060
Pde6a (RA1718: 11.7)
020406080
100120140
Pde6b (RA2664: 311; RA2321: 141)
05
1015202530
Phospholipase A2R (RA2707: 8.3)
Rel
ativ
e ex
pres
sion
HSP1a (RA2646: 8.6)
02468
101214161820
wt rd1
wt rd1 wt rd1 wt rd1
0123456789
RCK(RA2642: 5.0; RA2057: 10.2)
01020304050607080
CDR2 (RA1795: 3.3; RA2630: 17.7)
Rela
tive
expr
essio
n
wt rd1
012345678
RPGRIP(RA1953: 4.5; RA2057: 21.6)
wt rd1
0.00.20.40.60.81.01.21.4
Shab(RA1730: 7.7; RA2442: 21.3)
wt rd1
wt rd1 PR
Rel
ativ
e ex
pres
sion
vers
us 3
5drd
1
Br
02468
1012141618
02468
1012141618
Arrestin(RA2654: 14.7)
wt rd1 wt rd101234567
Aquaporin 1 (RA2635: 6.3)
wt rd1
35w
t/35r
d1
0
1
2
3
Adenylate cyclase 6 (RA1742: 4.0)
wt rd1
012345678
Alkaline phosphatase (RA2627: 13.3)
α-Transducin 1(RA2715: 272)
cGMP-gated channel(RA2669: 68.9)
HRG4(RA2673: 5.0)
050
100150200250300350
020406080
100120140160
35dw
t/35d
rd1
wt rd1 wt rd1 wt rd1
0123456789
wt rd1
0102030405060
Pde6a (RA1718: 11.7)
020406080
100120140
Pde6b (RA2664: 311; RA2321: 141)
05
1015202530
Phospholipase A2R (RA2707: 8.3)
Rel
ativ
e ex
pres
sion
HSP1a (RA2646: 8.6)
02468
101214161820
wt rd1
wt rd1 wt rd1 wt rd1
0123456789
RCK(RA2642: 5.0; RA2057: 10.2)
01020304050607080
CDR2 (RA1795: 3.3; RA2630: 17.7)
Rela
tive
expr
essio
n
wt rd1
012345678
RPGRIP(RA1953: 4.5; RA2057: 21.6)
wt rd1
0.00.20.40.60.81.01.21.4
Shab(RA1730: 7.7; RA2442: 21.3)
wt rd1
wt rd1 PR
Rel
ativ
e ex
pres
sion
vers
us 3
5drd
1
Rhodopsin(RA2680: 373; RA2316: 37.1)
100.200300400500600700800900
1000
0
Rel
ativ
e ex
pres
sion
vers
us 3
5drd
1
35 days8 days 15 days
wt rd1 wt rd1 wt rd1
Rhodopsin(RA2680: 373; RA2316: 37.1)
100.200300400500600700800900
1000
0
Rel
ativ
e ex
pres
sion
vers
us 3
5drd
1
35 days8 days 15 days
wt rd1 wt rd1 wt rd1
Rhodopsin(RA2680: 373; RA2316: 37.1)
100.200300400500600700800900
1000
0
Rel
ativ
e ex
pres
sion
vers
us 3
5drd
1
35 days8 days 15 days
wt rd1 wt rd1 wt rd1
Contol: known genesFirst validation : Real-time RT-PCR analysis
Data : relative expression level versus rd1 retina at PN35. N° after RA entry: ratio wt/rd1 at PN35 as scored by Affy. probes
In situ hybridization performed on frozen retinal sections using riboprobes. The specificity of the signal using antisense probes was controlled using the corresponding sense probe.
Second validation : In situ hybridization
Develoments : Automated statistical and biological inter-relational analysis
1ère étape : Classification automatique et standardisée
- New statistically funded clustering algorithms- Scoring methods for informational enrichment evaluation- Automated phylogenetic and transcriptomic promoter analysis - Data Driven Clustering: ClustPack and Gscope links
Cross-Validation
16CC3 (Q96DA0, Hypothetical protein - 16p13.3)Four Procure targets : hypothetical proteins (EX0062, EX1061, EX0468, EX0127)
2) TRANSFAC® Professional 8.1 (2004-03-31)
Developments: Promoter Analysis1) validated TSS in various complete or ongoing genomes
Matrix Search for Transcription Factor Binding SitesHigh quality multiple
alignment of the TSS regions
Phylogenetic footprinting and shadowing
prediction validationstatistical cross-validation…
Androgen Receptor binding site
STRING - Search Tool for the Retrieval of Interacting Genes/Proteins
Laboratory of Integrative Genomics and BioinformaticsIGBMC, Strasbourg