jonathan keith - exploring the structure of whole-genome conservation profiles using bayesian...

Download Jonathan Keith - Exploring the structure of whole-genome conservation profiles using bayesian segmentation

If you can't read please download the document

Upload: australian-bioinformatics-network

Post on 10-May-2015

412 views

Category:

Science


3 download

DESCRIPTION

Conservation is a key indicator of function in genomes, and can potentially be used to discover novel functional non‐protein‐coding RNAs and regulatory sequences. However, recent investigations have demonstrated that a simple dichotomy between conserved and non‐conserved sequence is too naïve a distinction to reflect the full complexity of the numerous types of structural and functional constraints acting on genomes. This presentation will discuss recent investigations into the detailed structure of whole‐genome conservation profiles, using Bayesian segmentation techniques to identify multiple classes of conservation level. By integrating information about conservation with profiles of other properties indicative of function, including GC content and transition/ transversion ratios, a much finer level of structure can be detected. The method has been applied to a range of species including Drosophila, zebrafish, malaria and bacterial genomes, and results from each of these will be presented. One key implication of these results is that the proportion of functionally constrained sequence in eukaryotic genomes may be very much larger than previously supposed. Another key implication is that genomic sequences may be subject to ephemeral functional constraints that act on too short a time scale to be detected in most comparative genomic studies. The functional content of various classes of conserved sequence will also be discussed. First presented at the 2014 Winter School in Mathematical and Computational Biology http://bioinformatics.org.au/ws14/program/

TRANSCRIPT

  • 1.Exploring the Structure of Whole-Genome Conservation Profiles using Bayesian Segmentation Jonathan M Keith Mathematical & Computational Biology Winter School Brisbane July 8, 2014

2. 2 Outline Introduction DNA/RNA/protein/ncRNAs Genome segmentation Incorporating multiple data types into segmentation Generalised Gibbs Sampler Applications The proportion of functional sequence in genomes Investigating alternative splicing Complexity of Drosophila 3 utrs Non-coding RNAs in zebrafish Non-coding RNAs in Wolbachia Regions contributing to malaria pathogenicity and host specificity Transcription factor binding sites in zebrafish 3. 3 DNA 4. 4 The Central Dogma of Molecular Biology 5. 5 Gene structure 6. Non-coding RNA 6 7. 7 Non-protein-coding RNA 8. Conservation 8 9. 9 Bayesian Genome Segmentation Input: sequence of characters from a finite alphabet Output: Segmentations Classifications 10. 10 Segmenting what? GC content binary 1=GC, 0=AT Pairwise alignment binary 1=match, 0=mismatch Multiple sequence alignment Column-wise counts of most frequent character Column-wise Parsimony score given phylogeny Pairwise conservation + GC content Pairwise content + GC content + indel frequency 11. Segmenting an alignment 11 Alignment encoded by a 32-character alphabet Human: GCCGA-- Mouse: GTC-A-- Zf : ATTAATG S : xZxIaJJ Species 1 A A A A A A A A A A A A A A A A C C C C C C C C C C C C C C C C Species 2 A A A A C C C C G G G G T T T T A A A A C C C C G G G G T T T T Species 3 A C G T A C G T A C G T A C G T A C G T A C G T A C G T A C G T Encoding a b c d e f g h i j k l m n o p q r s t u v w x y z U V W X Y Z Example 12. 12 A Bayesian model Binary sequences encoding matches and mismatches are assumed to have been generated by a process involving the following parameters: S binary sequence k number of change-points c = (c1,,ck) vector of change-points = (0,, k) vector of conservation levels for each segment probability that any given sequence position is a change-point g = (g0,, gk) vector containing, for each segment, the number of the group to which it belongs parameters of the beta distributions for each group proportion of segments in each group The algorithm generates segmentations (k,c) and group parameters (, ) and uses them to generate profiles for each group. 13. 13 Model parameters Bayesian hierarchical model includes the parameters shown Integrate over and , sum over g, sample k, c, and 14. 14 15. 15 16. Example: Subset-based sampling 16 1 3 1 3 1 3 17. Generalized Gibbs sampler 17 1 3 2 3 2 3 1 3 1 3 1 3 1 3 18. 18 Generalized Gibbs sampler X target set I index set (move types) U I X (x) elements of the form (i,x) Qx transition matrix on (x) qx stationary distribution for Qx (i,x) elements that can be reached from x by move type i I X U (i,x) (i,x) (x) 19. 19 Generalized Gibbs Sampler (discrete space) Starting with an arbitrary U0, perform the following steps iteratively: 1. [Q-step] Given Un = (i,x), generate (j,x) (x) using Qx((i,x), . ). 2. [R-step] Given (j,x), generate W (j,x) using R((j,x), . ). 3. Let Un+1 = W. R((i,x),( j,y)) = p(y)qy ( j) p(z)qz (k) (k,z)(i,x) 20. 20 For further details 21. 21 Order of updates 22. Model Selection 22 23. 23 Model selection AIC approximation BIC approximation DICV Lk ln22 Lngaskn ln2))1((ln ++ 24. Genome-wide conservation levels 24 Oldmeadow et al. (2010) Mol. Biol. Evol. 27(4): 942-953 25. Investigating alternative splicing 25 Boyd et al. (2014) PLoS ONE 7(3):e33565 26. Complexity of Drosophila 3 UTRs 26 Algama et al. (2014) PLoS ONE 9(5):e97336 27. Objec've: Methoddevelopmentinndingputa2vefunc2onalregionsin13 musclegenes Muscle Development Genes Project EYA1 SIX1 EYA4 SIX4 MYF5 WNT1 MYF6 WNT7a MYOD1 PAX3 MYOG PAX7 SHH 28. Generating Input Sequence Human A A A A C C C C G G G G T T T T - N Mouse A C G T A C G T A C G T A C G T N - code a b c d e f g h i j k l m n o p skip I 1.PairwiseAlignment 2.MultipleAlignment In 3-way alignment, columns with complementary bases are encoded using same letters Human indels are skipped ZF/Mouse indels are encoded with I 29. Results: EYA4 EYA4:311,523bp,and20exons Model Selection Conservation levels propor2onof charactersa,f,k,p insegments 30. Identification of conserved non coding sequences q Identifying putative functional elements (PFEs) in 13 muscle genes 30 Application of changept on eya1, 3-way alignment (human, mouse, zebrafish) 50% 65% 45% Conservation 31. Results 3way 31 Gene # PFEs # PFEs matched with EvoFold # PFEs matched with DNAse-footprints # PFEs matched with fRNAdb nc transcripts EYA1 6 5 6 1 (Transcript : 3679 , PFE:169) EYA4 2 1 SHH 4 1 1 PAX3 9 6 4 2 (Transcript : 1521 , PFE:126, 127) PAX7 6 4 3 MYF5 1 1 SIX4 1 1 TOTAL 29 17 16 3 32. PCR Results Expression was determined from pooled 24hpf zebrafish cDNA Muscle Genes Project: Lab Results 33. Bacterial Genome Project: wMel & wPip Modelselec'on AlignedusingMauvealignmentprogramwhich takesgenomicre-arrangementsintoaccountAIC DICV Conserva'on Levels 34. wMel & wPip: Results Referencespecies:wMel:1.27milbp&1309genes 11tRNAsand19non-codingregionswereiden2edwiththresholds: 1.Conserva2on>0.95 2.Prolevalue>0.75 3.Segmentlength>50bp WIGproleofthemostconservedgroup 35. 35 q Identified 17 tRNAs, 2 rRNAs, 2 ncRNAs, 2 pseudogenes and 19 intergenic regions with no previous annotations q Discovery of small non-coding RNAs from the obligate intracellular bacterium Wolbachia pipientis , (target journal: Parasites and Vectors) q Identifying putative ncRNAs in two bacteria genomes wMel and wPip tRNAncRNA Identification of conserved non coding sequences 36. wMel & wPip: Results changeptmainlyusedtosearchfor6intergenicregionsin wMelgenome,iden2edasbeingtranscribedusingRACE(Rapid amplica2onofcDNAends)method Currentlyconduc2nglabexperiments 37. Genomic regions contributing to malaria pathogenicity and host specificity The shared evolutionary history that has honed the ability of malaria parasites to use haemoglobin as a fundamental resource, coupled with complex and divergent vertebrate immune systems that work to preclude access to the resource, suggests that malarial genomes are a mosaic of conserved and divergent regions that reflect this tug-of-war. Portions of the genome that are directly affected or recognized by the hosts immune systems or are involved in infecting a host cell are likely to be divergent across parasites that specialise on different host species. On the other hand, portions that are fundamental for the parasites ability to use haemoglobin as a primary resource are likely to be conserved. We apply a Bayesian segmentation model to a three-way whole-genome alignment of Plasmodium falciparum (human malaria), P. reichnowi (chimpanzee malaria), and P. gallinaciaum (chicken malaria). Seeking novel regions relevant to drug targets. 37 38. Identifying TFBS Regions upstream of the zebrafish muscle development genes mentioned earlier identified as conserved contain regulatory elements Currently attempting to use the technique to identify TFBS genome-wide 38 39. 39 Acknowledgments Queensland University of Technology Kerrie Mengersen Chris Oldmeadow University of Queensland Peter Adams Dirk Kroese Darryn Bryant Benjamin Goursaud Rachel Crehange Institute for Molecular Biosciences John Mattick Stuart Stephen Monash University Manjula Algama Edward Tasker Robert Bryson-Richardson Caitlin Johnston Adam Parslow Beth McGraw Jean Popovici Meg Woolfit Anders Goncalves da Silva Australian Research Council Research Grants DP0879308, DP1095849 Email: [email protected]