rna multiple sequence alignment

38
RNA multiple sequence alignment Craig L. Zirbel [email protected] October 14, 2010

Upload: fritz

Post on 11-Jan-2016

68 views

Category:

Documents


1 download

DESCRIPTION

RNA multiple sequence alignment. Craig L. Zirbel [email protected] October 14, 2010. RNA primary sequences. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: RNA multiple sequence alignment

RNA multiple sequence alignment

Craig L. [email protected] 14, 2010

Page 2: RNA multiple sequence alignment

RNA primary sequences Laboratory techniques make it possible to

extract specific RNA molecules and determine the sequence of nucleotides. Here are the (unaligned) sequences of the 5S ribosomal RNA molecule from different organisms:

UUAGGCGGCCACAGCGGUGGGGUUGCCUCCCGUACCCAUCCCGAACACGGAAGAUAAGCCCACCAGCGUUCCGGGGAGUACUGGAGUGCGCGAGCCUCUGGGAAACCCGGUUCGCCGCCACC A H.m. (structure)GCCUGGCGGCCGUAGCGCGGUGGUCCCACCUGACCCCAUGCCGAACUCAGAAGUGAAACGCCGUAGCGCCGAUGGUAGUGUGGGGUCUCCCCAUGCGAGAGUAGGGAACUGCCAGGC B E.coli (structure)UCCCCCGUGCCCAUAGCGGCGUGGAACCACCCGUUCCCAUUCCGAACACGGAAGUGAAACGCGCCAGCGCCGAUGGUACUGGGCGGGCGACCGCCUGGGAGAGUAGGUCGGUGCGGGG B T.th. (structure)AGUGGUGGCCAUAUCGGCGGGGUUCCUCCCCGUACCCAUCCUGAACACGGAAGAUAAGCCCGCCAGCGUCCGGCAAGUACUGGAGUGCGCGAGCCUCUGGGAAAUCCGGUUCGCCGCCAC A L27170.1/1-120GUAGCGGCCACAGCGGUGGGGUUCCUCCCGUACCCAUCCCGAACACGGAAGAUAAGCCCACCAGCGUUCCGGGGAGUACUGGAGUGCGCGACCCUCUGGGAAACCGGGUUCGCCGCUAC A L27163.1/1-119GCGGCCAGGGCGGAGGGGAAACACCCGUACCCAUUCCGAACACGGAAGUGAAGCCCUCCAGCGAACCAGCUAGUACUAGAGUGGGAGACCCUCUGGGAGCGCUGGUUCGCCGCC A L27343.1/3-116UUUGGCGGUCAUGGCGUGGGGGUUUAUACCUGAUCUCGUUUCGAUCUCAGUAGUUAAGUCCUGCUGCGUUGUGGGUGUGUACUGCGGUUUUUUGCUGUGGGAAGCCCACUUCACUGCCAGAC A M36187.1/5-126GUUGGCGGUCAUGGCGUGGGGUUUAUACCUGAUCUCGUUUCGAUCUCAGUAGUUAAGUCCUGCUGCGUUGUGGGUGUGUACUGCGGUUUUUUGCUGUGGGAAGCCCACUUCACUGCCAGAC A X62857.1/1-121UUUGGCGGUCAUGGCGUGGGGGUUAUACCUGAUCUCGUUUCGAUCUCAGUAGUUAAGUCCUGCUGCGUUGUGGGUGUGUACUGCGGUGUUUUGCUGUGGGAAGCCCAUUUCACUGCCAGCC A X15364.1/6601-6721GUCGGUGGUGUUAGCGGUGGGGUCACGCCCGGUCCCUUUCCGAACCCGGAAGCUAAGCCUGCCUGCGCCGAUGGUACUGCACCUGGGAGGGUGUGGGAGAGUAGGACCCCGCCGGCA B M16176.1/4-120GUCGGUGGUUAUAGCGGUGGGGUCACGCCCGGUCCCAUUCCGAACCCGGAAGCUAAGCCCACCUGCGCCGAUGGUACUGCACCUGGGAGGGUGUGGGAGAGUAGGUCACCGCCGGCC B M16177.1/4-120GUUGGUGGUUAUUGUGUCGGGGGUACGCCCGGUCCCUUUCCGAACCCGGAAGCUAAGCCCGAUUGCGCUGAUGGUACUGCACCUGGGAGGGUGUGGGAGAGUAGGUCGCUGCCAACC B X55255.1/4-120UACGGCGGUCAAUAGCGGCAGGGAAACGCCCGGUCCCAUCCCGAACCCGGAAGCUAAGCCUGCCAGCGCCAAUGAUACUGCCCUCACCGGGUGGAAAAGUAGGACACCGCCGAAC B X55259.1/3-117UACGGCGGUCCAUAGCGGCAGGGAAACGCCCGGUCCCAUCCCGAACCCGGAAGCUAAGCCUGCCAGCGCCGAUGAUACUACCCAUCCGGGUGGAAAAGUAGGACACCGCCGAAC B X55251.1/3-116UACGGCGGCCACAGCGGCAGGGAAACGCCCGGUCCCAUUCCGAACCCGGAAGCUAAGCCUGCCAGCGCCGAUGAUACUGCCCCUCCGGGUGGAAAAGUAGGACACCGCCGAAC B X75601.1/91-203UAAGGCGGCCAUAGCGGUGGGGUUACUCCCGUACCCAUCCCGAACACGGAAGAUAAGCCCGCCUGCGUUCCGGUCAGUACUGGAGUGCGCGAGCCUCUGGGAAAUCCGGUUCGCCGCCUACU A X03407.1/5927-6048UUGGCGACCAUAGCGGCGAGUGACCUCCCGUACCCAUCCCGAACACGGAAGAUAAGCUCGCCUGCGUUUCGGUCAGUACUGGAUUGGGCGACCCUCUGGGAAAUCUGAUUCGCCGCCACC A L27168.1/1-120GGCGGCCAGAGCGGUGAGGUUCCACCCGUACCCAUCCCGAACACGGAAGUUAAGCUCACCUGCGUUCUGGUCAGUACUGGAGUGAGCGAUCCUCUGGGAAAUCCAGUUCGCCGCCC A X02128.1/24-139GGGCGGCCAGAGCGGUGAGGUUCCACCCGUACCCAUCCCGAACACGGAAGUUAAGCUCGCCUGCGUUCUGGUCAGUACUGGAGUGAGCGAUCCUCUGGGAAAUCCAGUUCGCCGCCCCU A X14441.1/5-123

Page 3: RNA multiple sequence alignment

Watson-Crick basepairs Watson-Crick basepairs can substitute for one another

freely without changing the structure of the RNA molecule. They are said to be isosteric, and changes between these basepairs is an example of neutral variability. They are held together by hydrogen bonds (dotted lines).

Superposition

Page 4: RNA multiple sequence alignment

RNA sequence variability

To preserve RNA helices, compensating mutations must be made; to replace a GC basepair with an AU basepair, two letters must change in distant regions of the sequence; see below. Statistically, this is called “long-range dependence.”

Compensating mutations such as this do not change the secondary or tertiary structure of the molecule.

UGCCUGGCGACCGUAGCGCGGUGGUCCCACCUGACCCCAUGCCGAACUCAGAAGUGAAACGCCGUAGCGCCGAUGGUAGUGUGGGGUCUCCCCAUGCGAGAGUAGGGAAUUGCCAGGCAU

UGCCUGGCGGCCGUAGCGCGGUGGUCCCACCUGACCCCAUGCCGAACUCAGAAGUGAAACGCCGUAGCGCCGAUGGUAGUGUGGGGUCUCCCCAUGCGAGAGUAGGGAACUGCCAGGCAU

Page 5: RNA multiple sequence alignment

Comparative sequence analysis By manually aligning similar RNA sequences and noting

the pairs of columns where mainly AU, CG, GC, and UA pairs occur, one can infer the secondary structure of an RNA molecule.

• This is the inferred secondary structure of the 5S RNA, with bases labeled as found in E. coli. There are five helical regions, with three “internal loops” and two “hairpin loops” separating them. Note the colors!

Fox & Woese 1975; Peattie et al. 1981; Noller 1984; Cannone et al. 2002; http://www.rna.ccbb.utexas.edu

UGCCUGGCGGCCGUAGCGCGGUGGUCCCACCUGACCCCAUGCCGAACUCAGAAGUGAAACGCCGUAGCGCCGAUGGUAGUGUGGGGUCUCCCCAUGCGAGAGUAGGGAACUGCCAGGCAU

Page 6: RNA multiple sequence alignment

RNA 3D structure• Starting late in the year

2000, high-resolution atomic structures of entire ribosomes have been published. These show the bases, the backbone, the Watson-Crick basepairs, and several new types of basepairs.

E. coli 5S

The 2009 Nobel Prize in Chemistry went to Yonath, Ramakrishnan, and Steitz for their work on x-ray crystal structures of ribosomes.

Page 7: RNA multiple sequence alignment

Three 5S rRNA 3D structures

Haloarcula marismortui E. coli Thermus thermophilus

Page 8: RNA multiple sequence alignment

RNA multiple sequence alignment The same RNA in different organism can

be presumed to have the same, or roughly the same, secondary and 3D structure.

Compensating changes far apart in the sequence make it hard to use multiple sequence alignment tools that were developed for proteins.

Page 9: RNA multiple sequence alignment

Two situations for RNA multiple sequence alignment1. We have two or more sequences from the

same RNA, but don’t know their common secondary structure or 3D structure

2. We have RNA sequences and a common secondary structure or even a single 3D structure which we can assume they all share to some degree

Page 10: RNA multiple sequence alignment

10-14-2010

RNA MSARNA Multiple Sequence AlignmentSlides by Anton Petrov, Ph.D. student, BGSU

Page 11: RNA multiple sequence alignment

Why DNA and protein alignment methods don’t work for RNA

RNA sequences may look dissimilar but still fold into the same structure.

Gorodkin et al., 2010. Trends in biotechnology

Page 12: RNA multiple sequence alignment

Gorodkin et al., 2010. Trends in biotechnology

Example

Page 13: RNA multiple sequence alignment

RNA-specific alignment methods

FOLDALIGN http://foldalign.ku.dk/index.html

MAFFT http://mafft.cbrc.jp/alignment/server/

LocARNA http://rna.informatik.uni-freiburg.de:8080/LocARNA.jsp

R-Coffee http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi?stage1=1&daction=RCOFFEE::Regular

and many others...

Page 14: RNA multiple sequence alignment

RNA MSA and ncRNA discovery Conservation is a reliable indicator of biological

importance. If an RNA fragment is conserved across multiple

species, it may function as ncRNA. ncRNA discovery programs scan multiple genomic

sequences in order to detect putative ncRNA candidates.

MSA is an essential part of the ncRNA discovery pipeline.

Page 15: RNA multiple sequence alignment

RNA MSA and ncRNA discovery

Multiple sequence alignment

ncRNA discovery

Secondary structure prediction

Align first

Fold first

Align and fold simultaneously

Page 16: RNA multiple sequence alignment

RNAz Once you have a good MSA, you can use

tools like RNAz to scan your alignment for conserved stable secondary structures, which may function as ncRNAs.

http://rna.tbi.univie.ac.at/cgi-bin/RNAz.cgi

Page 17: RNA multiple sequence alignment

Suggested reading

Page 18: RNA multiple sequence alignment

Alignment to a common secondary structure One standard starting point is a “seed” alignment

of 20-100 RNA sequences together with a “dot-bracket” secondary structure diagram.

Infernal is a program that makes a “covariance model” based on the seed alignment and allows one to align new sequences to this model, thus aligning new sequences to an existing alignment.

Page 19: RNA multiple sequence alignment

Alignment to a model based on a 3D structure One focus of the BGSU RNA group Take an RNA 3D structure, with all of the

detail it gives about Watson-Crick basepairs and other RNA basepairs

Make a model for sequence variability Align RNA sequences to the model, and

thus to one another.

Page 20: RNA multiple sequence alignment

H.m. 5S rRNA basepair diagram

Standard AU and GC Watson-Crick basepairs are denoted by = or –

In other pairs, a circle stands for the Watson-Crick edge, a square for the Hoogsteen edge, and a triangle for the Sugar edge.

The basepair diagrams for E.coli and T.Th. are similar

Working hypothesis: other organisms have largely the same basepair diagram, with neutral basepair substitutions that do not alter the 3D structure

Page 21: RNA multiple sequence alignment

Non-Watson-Crick basepairs The 3D structures show a variety of planar basepair

interactions other than Watson-Crick basepairs. These occur between helices and allow the RNA molecule to achieve tighter turns or other important 3D structural features.

trans Hoogsteen / Sugar Edge

A78-G98 in E.coli 5S

A45-U40 in E.coli 5S

cis Watson-Crick / Sugar Edge

A57 – C30 in E.coli 5S

A46 – A39 in E.coli 5S

trans Sugar Edge / Sugar Edge

G13 – G69 in E.coli 5S

Page 22: RNA multiple sequence alignment

Isostericity for non-Watson-Crick basepairs

Non-Watson-Crick basepairs have different basepair substitution (isostericity) rules than Watson-Crick pairs. Below are some examples of geometrically similar basepairs.

trans Hoogsteen / Sugar Edge

A78-G98 in E.coli 5S

A45-U40 in E.coli 5S

cis Watson-Crick / Sugar Edge

A57 – C30 in E.coli 5S

A46 – A39 in E.coli 5S

trans Sugar Edge / Sugar Edge

G13 – G69 in E.coli 5S

Page 23: RNA multiple sequence alignment

Stochastic grammars Stochastic grammars are probabilistic models for

sequences of characters or words. They are capable of enforcing specified

grammatical rules but allowing for variability in the specific sequence.

The classic example: Colorless green ideas slept furiously

obeys English grammatical rules, but is a very unlikely sentence to occur in normal English.

Context free grammars have certain limitations on the grammatical rules that can be enforced.

Chomsky 1956, 1959; Durbin and Eddy 1994.

Page 24: RNA multiple sequence alignment

Simple SCFG model for RNA From the basepair diagram, we construct a model which

mimics the structure of the molecule but which allows for neutral basepair variability and other minor variations.

The 5S itself is too large, so we display a very small cartoon of the 5S molecule.

5’

3’

Page 25: RNA multiple sequence alignment

Using the SCFG model to generate sequence variants

The Initial node generates letters independently with a given length and letter distribution. This time we get an A on the left and CA on the right.

ACCUGUUUCGACACAGGGAAGACAGAUGAGCA

Page 26: RNA multiple sequence alignment

A Basepair node generates a (dependent) pair of letters and independent insertions. The first Basepair node generates a CG pair and inserts an A on the right (before the G).

ACCUGUUUCGACACAGGGAAGACAGAUGAGCA

Using the SCFG model to generate sequence variants

Page 27: RNA multiple sequence alignment

The Basepair node generates CG with no insertions.

ACCUGUUUCGACACAGGGAAGACAGAUGAGCA

Using the SCFG model to generate sequence variants

Page 28: RNA multiple sequence alignment

The Junction node generates nothing, but passes control to its two child nodes.

ACCUGUUUCGACACAGGGAAGACAGAUGAGCA

Using the SCFG model to generate sequence variants

Page 29: RNA multiple sequence alignment

The Initial node on the left branch generates U on the left and AC on the right; the Initial node on the right branch generates AU.

ACCUGUUUCGACACAGGGAAGACAGAUGAGCA

Using the SCFG model to generate sequence variants

Page 30: RNA multiple sequence alignment

The Basepair node on the left branch generates GC; the Basepair node on the right branch generates AG.

ACCUGUUUCGACACAGGGAAGACAGAUGAGCA

Using the SCFG model to generate sequence variants

Page 31: RNA multiple sequence alignment

The Basepair node on the left branch generates UA; the Basepair node on the right branch generates GA.

ACCUGUUUCGACACAGGGAAGACAGAUGAGCA

Using the SCFG model to generate sequence variants

Page 32: RNA multiple sequence alignment

The Hairpin node on the left branch generates UUCG (a variant of the UNCG hairpin) and the last Basepair node on the right branch generates GC.

ACCUGUUUCGACACAGGGAAGACAGAUGAGCA

Using the SCFG model to generate sequence variants

Page 33: RNA multiple sequence alignment

Finally, the last Hairpin generates GAAGA (a variant of the GNRA loop with one insertion) and generation stops.

ACCUGUUUCGACACAGGGAAGACAGAUGAGCA

Using the SCFG model to generate sequence variants

Page 34: RNA multiple sequence alignment

Parsing sequences according to a model

Typically, a model can generate the same sequence in several different ways.

Given a model and a sequence that was generated by the model, we want to determine the single way of generating the sequence that is most likely.

The most likely generation history tells which node generated which part of the sequence, and so aligns the sequence to the model.

Page 35: RNA multiple sequence alignment

Multiple ways to generate a sequence Here is another way that the

same simple model could have generated the sequence. This generation history would have very low probability, since the letters indicated to make certain basepairs are not isosteric with the originally observed basepair.

ACCUGUUUCGACACAGGGAAGACAGAUGAGCA

Page 36: RNA multiple sequence alignment

Determining the maximum probability generation history for a sequence

The CYK (Cocke, Younger, Kasami) dynamic programming algorithm has each node (from leaves to root) consider each subsequence (from shorter to longer) to consider the maximum probability way that it and its children would generate the subsequence.

Here, the blue Basepair node considers how it and its children can generate the colored subsequence.

ACCUGUUUCGACACAGGGAAGACAGAUGAGCA

Page 37: RNA multiple sequence alignment

Determining the maximum probability generation history for a sequence The blue Basepair node considers

another way for it and its children to generate the colored subsequence.

The red nodes have already considered every subsequence of this length and shorter.

The algorithm runs in O(L2M) time, where L is the length of the input sequence and M is the number of nodes.

ACCUGUUUCGACACAGGGAAGACAGAUGAGCA

Page 38: RNA multiple sequence alignment

Uses for SCFG models for RNA

Multiple sequence alignment of RNA Searching genomes for RNAs homologous to a given

RNA Infernal is a commonly used SCFG program

Note: Some RNAs are absolutely ancient: messenger RNA, ribosomal RNA, transfer RNA, but there are many RNAs that people are just learning about now, like microRNAs and other regulatory RNAs. They occur in UTRs, introns, and intergenic regions, and we need to be able to recognize them!