combining heterogeneous data to reverse engineer regulatory networks allan tucker school of...

30
Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University, London. UB8 3PH

Upload: melanie-anthony

Post on 13-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

Combining heterogeneous data to reverse engineer regulatory

networks

Allan TuckerSchool of Information Systems Computing and Mathematics, Brunel University, London. UB8

3PH

Page 2: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

Intelligent Data Analysis

• IDA attempts to deal with data explosion to discover patterns and knowledge from data• Typical analysis tasks:• Clustering • Classification• Feature Selection• Prediction and Forecasting• Structure identification

Page 3: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

Bayesian Networks

• An IDA method to model a domain using probabilities• Easily interpreted by non-statisticians• Can be used to combine existing knowledge with data• Essentially use independence assumptions to model the joint distribution of a domain

Page 4: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

Informative Priors

• To build BNs we can also use prior structures and probabilities• These are then updated with data• Usually uniform (equal probability)• Informative Priors used to incorporate existing knowledge into BNs

Page 5: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

Microarray Data

• Major source of data for gene expression activity

• Technology takes measurements over 1000s of genes simultaneously

• Gene Regulatory Networks (GRNs) model how genes interact

• Eliciting reliable GRNs from data key to understanding biological mechanisms

Page 6: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

But...

• Reliability issues that surround microarray gene expression data

• Mechanisms in different systems & species

• Can we build GRN models that have enhanced performance, based on a richer and/or broader collection of data than a single microarray dataset?

Page 7: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

The talk

• Incorporating literature priors• Consensus networks• Models of Increasing Complexity• Interspecies analysis

Page 8: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

Literature-based priors

• Information about biomedical concepts such as genes summarized using concept profiling (Jelier et al., 2007; Schuemie et al., 2007a)

• Combine information from several databases, including Entrez Gene, Uniprot, and the Saccharomyces Genome Database

• Concept profile is a vector of concepts with weights

• Weight represents uncertainty between occurrence of one concept and another(2009) Steele, E., Tucker, A., 't Hoen, P.A.C. and Schuemie, M.J., Literature-Based Priors

for Gene Regulatory Networks, Bioinformatics 25 (14) : 1768-1774

Page 9: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

Literature-based priors

• Perform Pearson correlation on concpet profiles of genes to create a literature matrix

• Translate correlations into probs using confidence scores. Represents prob that a particular correlation was not drawn from the distribution of random gene-pair correlations

• Not equal to probability that edge exists – see Segal et al. (2002) and Efron (2007)

• Incorporate as a prior into BIC score:BIC = w log P(S) + log P(S|D) - 0.5 k log(n)

Page 10: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

The Experiments

• Test our approach on synthetic networks generated using differential equations, yeast studies and e coli studies with known regulatory structures

•Report on ROC analysis:• True Positives: links that are correctly id• False positives: links that are incorrectly id• False Negatives: links that are missed• True Negatives: links that are correctly missed

• Also predictive power using CV

Page 11: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

Yeast and E-Coli Network Analysis

• Issues with circularity when validating

Page 12: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

Predictive accuracy

Page 13: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

• A literature prior weight of between 0.4 and 0.6 appears best choice to identify relevant regulatory edges on human data for mechanisms involving Muscular Dystrophy

• Higher prior weights lead to inclusion of too many edges (literature associations not of regulatory nature)

• A lower weight than the optimum prior weights found for yeast and E. coli

• Perhaps because less literature on the human organism whereas yeast and E. coli are both well-studied.

Literature Priors Conclusions

Page 14: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

Consensus Bayesian Networks

• Different platforms involve different biases:

e.g. Oligonucleotide estimates of absolute value of expression whereas cDNA measures relative differences between genes.

• Previous research established comparing datasets using standard normalisation is difficult and not straightforward

• An attempt to combine multiple microarray data sources through post-learning aggregation

Steele, E. Tucker A. “Consensus and Meta-analysis regulatory networks for combining multiple microarray gene expression datasets”, Journal of Biomedical Informatics 41(6), pp 914-926 , 2008

Page 15: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

Consensus Bayes Networks

Page 16: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

Consensus Bayesian Networks

• Bootstrapping on each dataset to generate robust networks with confidence• Threshold the confidence and generate a PDAG (due to equivalence classes)• Consensus looks for edges with enough support in the input networks• Edge direction is based upon voting of inputs – or left undirected if there is no consensus or if cycles cannot be resolved

Page 17: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

Consensus Bayes Networks

Page 18: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

E Coli

Page 19: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

Yeast

Page 20: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

Weighting networks

Steele, E. and Tucker, A., Selecting and Weighting Data for Building Consensus Gene Regulatory Networks, Advances in Intelligent Data Analysis VIII: 8th International Symposium on Intelligent Data Analysis (IDA 2009). Lecture Notes in Computer Science, volume 5772: 190-201, 2009

Page 21: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

c) Models of Increasing Complexity

Specification of three muscle differentiation datasets

(2010) Anvar, S.Y., t' Hoen, P.A.C. and Tucker, A., The Identification of Informative Genes from Multiple Datasets with Increasing Complexity, BMC Bioinformatics 11 : 32

Page 22: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

MIC

• Select one dataset for training• Others become test sets• Score mean and variance of SSE using CV and indpt test sets• Use these to rank genes

Page 23: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

MIC - Datasets

• All concerned with the differentiation of cells into the muscle (Myogenic) lineage• In-vitro system mimics the formation of new muscle fibres in-vivo• Cao uses embryonic fibroblasts, others use tumor cell line that has the potential for differentiation into different lineages (mainly muscle and bone)• Cao use MyoD and MyoG to force cell differentiation (others use serum starvation)• Sartorelli includes different treatments that affect timing and efficiency

Page 24: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

MIC Select genes using one dataset (black) at a time and compare average CV error rate of BN classifier learnt on same dataset and validated on the other two datasets independently (grey).

Cao does well on CV but overfitsTomzczak does well on both

Page 25: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

MIC • Select 100 informative (KS test), and 50 uninformative genes. • Train BN classifier on Tomczak and test on Sartorelli. • Rank genes according to average error rate.• Score average improvement or deterioration of Myogenesis-Related, Top 100 and 50 random selected genes in Sartorelli • Compare our method with rankings generated by concordance model.

Page 26: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

MIC Conclusions

• Highly predictive and consistent genes from pool of differentially expressed genes, across independent datasets are more likely to be fundamentally involved in the biological process under study • Results imply that gene regulatory networks identified in simpler systems can be used to model more complex biological systems

Page 27: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

MIC Conclusions

• e.g. muscle differentiation: myogenesis-related network is difficult to derive from in vivo experiments due to presence of multiple cell types and higher biological variation

• But may become evident after initial training of the network on the cleaner in vitro experiments

Page 28: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

Inter-species Mechanisms

Page 29: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

Summary

• Explored a number of novel techniques for buidling more Reliable GRNS• Incorporating exogenous knowledge in the form of BN Priors constructed from biological abstracts• Consensus algorithms for post-learning aggregation of data / networks• Models of increasing complexity for identifying genes that are more confidently associated with a biological process

• Future work – extending MIC to inter-organism mechanisms

Page 30: Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University,

Thanks

Dr Emma Steele, previously Brunel

Mr Yahya Anvar & Dr Peter-Bram ‘t Hoen, Leiden University Medical School, Netherlands