![Page 1: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/1.jpg)
Statistical Data Fusion to Prioritize Lists of Genes
Bert Coessens, Stein Aerts
Departement ESAT - SCDKatholieke Universiteit Leuven
Promotor: Bart De MoorAssessor: Yves Moreau
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
![Page 2: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/2.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Context
x
xx
x
x
x
x
xx
Linkage AnalysisPositional Cloning
NEFL
RAB7
GARS
GIB1
LMNA
High-throughput technologies
![Page 3: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/3.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Concept
Pathology / Biological process / …
Gene Expression Literature
AnatomicalExpression
GeneRegulation
ProteinDomains
FunctionalAnnotation
EvolutionaryConservation …
![Page 4: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/4.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Concept
Model with multiple submodels
Training genes
Training set
Choose submodels TRAIN
Candidate genes
Test set
One ranking foreach submodel
Combinedranking
Orderstatistics
SCORE
genei
![Page 5: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/5.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Order Statistics
Given a set of n rank ratios for genei
- what is the probability of getting these ratios by chance alone?
Q r1,r 2, ... , r n n!0
r 1
s1
r2
...sn 1
rn
dsn dsn 1 ...ds1
Joint probability density function of all n order statistics:
V ki 1
k 1
1 i 1 V k i
i !rn k 1
i
Complexity O(n2)
![Page 6: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/6.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Statistical Validation: Setup
29 lists of disease genes from OMIM
5 lists of random genesfrom the human genome
Foreach disease or random gene set do:Foreach gene in the set do:a. Leave one gene outb.TRAIN all submodels on the set minus the left-out genec. Create a test set by adding left-out gene to [9, 49, 99] random genesd. SCORE the test set with all trained submodelse. RANK the genes in the test set according to their order statistics p-valueend
end
Calculate for a certain cut-off x the number of - TP: number of left-out genes ranked above x- FP: number of genes but left-out gene ranked above x- TN: number of genes but the left-out gene ranked below x- FN: number of left-out genes ranked below x
Calculate sensitivity and specificity using the above mentioned values,plot (1-specificity) versus sensitivity to obtain a Rank ROC plot andcalculate the area under the curve.
![Page 7: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/7.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Statistical Validation: Disease genes
![Page 8: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/8.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Statistical Validation: Disease genes
- 29 human diseases (OMIM) = 29 gene sets- 627 disease genes with Ensembl identifier in total- average gene set contains 19 genes- smallest gene set = ALS with 4 genes- largest gene set = leukemia with 113 genes
![Page 9: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/9.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Statistical Validation: SubmodelsTextual data: TXTGate
Sequence similarity: BLAST
+
Rank genes according to e-value
Example: Presenilin 1 vs. Presenilin 2 e-value = 10-133
![Page 10: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/10.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Statistical Validation: Submodels
Functional annotation: GO
Functional annotation: Kegg
Set ofgenes
GO IDs observed
frequencies
Full Genome
GO IDsGO-id
expected frequencies
GO IDs
![Page 11: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/11.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Statistical Validation: Submodels
Protein information: InterPro
Protein information: BIND
Training genes+
Interaction partners
Test gene+
Interaction partners
Overlap?
![Page 12: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/12.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Statistical Validation: Submodels
Gene expression: Microarray data
Gene expression: ESTs
- Model is average expression profile of training genes- Score test gene by calculating Pearson correlation
Human gene expression atlas: Su et al.47 normal human tissues
![Page 13: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/13.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Statistical Validation: Submodels
Cis-regulatory elements: TFBSs
Cis-regulatory elements: TFBS modules
- Check human-mouse CNS blocks in upstream sequence of a test gene
- Compare found motifs with motifs in training set
ModuleSearcher:searches best combination of 3 TFs in 300 bp USof genes in training set
ModuleScanner:scores test gene with model
![Page 14: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/14.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Statistical Validation: Similarity
Statistical meta-analysis
Vector-based similarity
Fisher’s methodAssume there are m independent tests of H0.1. For the i-th test calculate the corresponding p-value, pi.2. If pi has a uniform distribution on [0,1],
then –2Σlog pi has a χ2m
distribution.
T1
T3
T2
- Euclidean distance- Pearson correlation- Cosine similarity
![Page 15: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/15.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Statistical Validation: Correlation
![Page 16: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/16.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Statistical Validation: Rank ROC
![Page 17: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/17.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Statistical Validation: Submodel Rank ROC
![Page 18: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/18.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Statistical Validation: Bias towards known genes
![Page 19: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/19.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Endeavour Application: Screenshot
![Page 20: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/20.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Endeavour Application: Architecture
ESATWeb server
Linux cluster
Java RMI
SOAP messages
![Page 21: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/21.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Conclusions and Future
- Different weighting for different submodels- Explore mathematical modeling techniques (neural nets, SVM)- Add more information models- Define best combination of submodels
F
- Allows integration of heterogeneous data- Solves problem of uncertainty- Solves multiple testing problem (Bonferroni correction)- Allows for cut-offs with statistical significance
C
![Page 22: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/22.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Acknowledgements
Bart De MoorStein Aerts Yves Moreau
Patrick Glenisson Steven Van Vooren Joke Allemeersch
![Page 23: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/23.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Endeavour Application: Demo
Load training set
![Page 24: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/24.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Endeavour Application: Demo
Add submodels
![Page 25: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/25.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Endeavour Application: Demo
Train submodels
![Page 26: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/26.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Endeavour Application: Demo
Load candidate genes
![Page 27: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/27.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Endeavour Application: Demo
Score candidate genes with all submodels
![Page 28: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/28.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Endeavour Application: Demo
Results of scoring
![Page 29: Statistical Data Fusion to Prioritize Lists of Genes Bert Coessens, Stein Aerts Departement ESAT - SCD Katholieke Universiteit Leuven Promotor: Bart De](https://reader036.vdocuments.net/reader036/viewer/2022062318/5515ed0b550346cf6f8b5220/html5/thumbnails/29.jpg)
Database Issues in Biological Databases (DBiBD), January 8-9, 2005
Endeavour Application: Demo
Ranking visualized in sprintplot