initial proposal for the rna alignment ontology rob knight dept chem & biochem cu boulder

Download Initial Proposal for the RNA Alignment Ontology Rob Knight Dept Chem & Biochem CU Boulder

Post on 20-Dec-2015




1 download

Embed Size (px)


  • Slide 1
  • Initial Proposal for the RNA Alignment Ontology Rob Knight Dept Chem & Biochem CU Boulder
  • Slide 2
  • What do we want to do? Represent detailed structural info and other metadata on alignment Avoid horizontal and vertical expansion Explicitly annotate correspondences at the level where they occur
  • Slide 3
  • What do alignments look like now?
  • Slide 4
  • Why is this a problem?
  • Slide 5
  • so real alignments look like this, to shoehorn everything into columns that are assumed to be homologous
  • Slide 6
  • Homology is problematic Fundamental problem: systems that are homologous at one level are not necessarily homologous at other levels E.g. bat wings and bird wings: homologous as pentadactyl limbs, but not homologous as wings Homology is hierarchical and can partially overlap at any level (e.g. Griffiths 2006) Ridley Evolution 3rd ed. Bat forelimbs Bird forelimbs Frog forelimbs Rodent forelimbs Mammal forelimbs Tetrapod forelimbs
  • Slide 7
  • and correspondence need not be homology at all! Example from SELEX: hammerhead ribozymes independently evolved at least three times: in nature, and in Jack Szostak and Ron Breakers labs However, we still want to be able to align the functionally equivalent sequences although there is not evolutionary relationship
  • Slide 8
  • So what are going to use the alignment ontology for?
  • Slide 9
  • Use case 1: aligning rRNA
  • Slide 10
  • Problem: have millions of fragments, want to align (incl. noncanonical pairs) + assign named regions
  • Slide 11
  • Solution Use existing alignment, try to fit new seqs in Would be improved if we could explicitly annotate helices, noncanonical pairs, etc. on the sequence overall For display, need to easily show/hide groups of sequences and/or regions of the sequence
  • Slide 12
  • Use case 2: SELEX From large number of unaligned sequences, want to identify motifs like this (Majerfeld & Yarus 2005)
  • Slide 13
  • How is this currently done? Find regions that are similar in more sequences than chance Group these sequences centered on the motif See if the parts of the motif can be related by helices See if anything else is reliably found by the motif Repeat for other families and see if there are relationships between them Group these families together, then iterate
  • Slide 14
  • e.g. here we discovered unpaired G important
  • Slide 15
  • So how do we handle all this? A proposal Entities: sequence_region: a thing that defines a set of bases relative to some sequence (i.e. with indices for each base) paired_sequence_region: two regions linked by pairs helical_sequence_region: two regions completely paired base: region that consists of single nucleotide base_pair: region that consists of two, paired bases canonical_base_pair: base pair that is cis-WW loop: contiguous sequence_region stretching from i to j such that i-1 and j+1 are a base pair etc. (bulge, internal_loop, junction, etc.)
  • Slide 16
  • So how do we handle all this? A proposal Relationships: correspondence: relation among set of sequence_regions implying all share a feature (with metadata about how determined) homology: correspondence implying continuous chain of descent preserving the relation sequence_similarity: correspondence implying regions are similar in primary sequence two_d_structure_similarity: correspondence implying regions are similar in 2D structure, i.e. nested canonical base pairs secondary_structure_similarity: correspondence implying regions are similar in secondary structure, i.e. incl. pseudoknots/noncanonicals tertiary_structure_similarity: correspondence implying regions are similar in 3D structure
  • Slide 17
  • So how do we handle all this? A proposal Relationships: pairing: relation that asserts that two sequence_regions each have parts of at least one base_pair that connects them helical_pairing: pairing that includes several base_pairs (not necessarily contiguous) between two sequence_regions unbroken_helical_pairing: helical_pairing that includes no bases in the sequence_regions that are not paired with the other sequence_region, in order base_pairing: pairing that connects exactly two bases, annotated with the Leontis-Westhof classification More exotic uses for alignment: microrna_target: pairing relation in which one member is a miRNA and the other is an mRNA according to SO same_microrna_target: a relation among a set of sequences that have microrna_target relation to the same miRNA
  • Slide 18
  • Implementation notes Must be able to name regions (e.g. P3 in RNaseP) and subclass them (e.g. P3 in firmicutes) Must be able to subclass homologies, e.g. homologous as wing vs. homologous as limb Correspondences are all symmetric and transitive, so can implement as set of regions that share the correspondence (probably) dont want to reify names of parts of well-known RNAs in the overall RNAO?
  • Slide 19
  • Acknowledgements RNA Alignment Ontology working group: James. W. Brown Fabrice Jossinet Rym Kachouri B. Franz. Lang Neocles Lenotis Gerhard Steger Jesse Stombaugh Eric Westhof Other coauthors: Amanda Birmingham Paul Griffiths Franz Lang NSF RCN grant # 0443508 Knight Lab members: Cathy Lozupone Micah Hamady Chris Lauber Jesse Zaneveld Jeremy Widmann Elizabeth Costello Jens Reeder Daniel McDonald Anh Vu Ryan Kennedy Julia Goodrich Meg Pirrung Reece Gesumaria Trp project: Irene Majerfeld Jana Chochosolousova Vikas Malaiya Matthew Iyer Mike Yarus