rna structure prediction including pseudoknots based on stochastic multiple context-free grammar

Post on 19-Mar-2016

34 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar. PMSB2006, June 18, Tuusula, Finland Yuki Kato, Hiroyuki Seki and Tadao Kasami Graduate School of Information Science, Nara Institute of Science and Technology (NAIST). NAIST. Table of Contents. - PowerPoint PPT Presentation

TRANSCRIPT

RNA Structure Prediction Including Pseudoknots

Based on Stochastic Multiple Context-Free Grammar

PMSB2006, June 18, Tuusula, Finland

Yuki Kato, Hiroyuki Seki and Tadao KasamiGraduate School of Information Science,

Nara Institute of Science and Technology (NAIST)

2

NAIST

3

Table of Contents

Background Grammatical approach to RNA structure model

ing Model

Stochastic multiple context-free grammar Algorithms

Parsing and parameter estimation Experimental results

RNA pseudoknot prediction Summary

4

RNA Secondary Structure:Stem-Loop

5’—C A A U G A C—3’

C•GU•AU•A

U C A A

C U U C A U C A G A A A A U G A C

nested

Connect base pairs with arcs.

Loop

Stem

Complementarybase pairs

A•UG•C

5

Modeling RNA Secondary Structure by Context-Free Grammar (CFG)

RNA secondary structure can be modeled by parse structure of CFG.Structure prediction Parsing

Example of CFG rules:

U U C A U C A G A ASecondary structure

S

S

S

S

u u c a u c a g a aDerivation tree

6

RNA Secondary Structure:Pseudoknot

CFGs cannot represent pseudoknots.

5’—C U U C A A G A C U U G A C—3’

• • •• • •

A

AA

C U U C A U C A G A A A A U G A C

crossed

Connect base pairs with arcs.

7

Early Studies

Brown & Wilson /

Cai et al.

Rivas & Eddy / Uemura e

t al.

Matsuiet al.

Generative power

Based on CFG > CFG > CFG

Metric for optimization

Stochastic grammar Free energy Stochastic

grammar

Time complexity

O(n3) / O(n6) O(n6) / O(n5) O(n5)

n: sequence length

8

Early Studies (cont.)

Grammars for fully describing RNA pseudoknots: SL-TAG and ESL-TAG [Uemura et al., 1999] RPG [Rivas and Eddy, 2000]

These grammars have been identified as subclasses of multiple context-free grammars. [Kato et al., 2005]

9

Motivation

Multiple context-free grammar (MCFG): Natural extension of CFG

Easy to compare generative power and design algorithms

Generative power to represent pseudoknots Polynomial time parsing algorithm

We have shown a candidate subclass of the minimum grammars of MCFGs for representing pseudoknots.[Kato et al., 2005]

10

What’s New in the Present Work

Extension of MCFGs to a probabilistic model (stochastic MCFG, SMCFG)

Design of polynomial time parsing and parameter estimation algorithms for the subclass of SMCFGs

Experiments on RNA pseudoknot prediction

11

Early Studies and Present Work

Brown & Wilson /

Cai et al.

Rivas & Eddy / Uemura et a

l.

Matsuiet al.

Our model

Generative power

Based on CFG > CFG > CFG > CFG

Metric for optimization

Stochastic grammar Free energy Stochastic

grammarStochastic grammar

Time complexity O(n3) / O(n6) O(n6) / O(n5) O(n5) O(n5)

12

Table of Contents

Background Grammatical approach to RNA structure model

ing Model

Stochastic multiple context-free grammar Algorithms

Parsing and parameter estimation Experimental results

RNA pseudoknot prediction Summary

13

Relation between SMCFG and Major Probabilistic Models

MCFG

CFG

FA

SMCFG

SCFG

HMM

Generativepower

Strong

Weak

Probabilisticextension

A G A C U UPseudoknot

A G A C UStem-loop

Gene finding

genes

14

From HMM to SCFG

HMM SCFG

RuleTerminal and nonte

rminal

“Emit a in Aand transit to B”

Sequence of nonterminals

and terminals

15

Stochastic Multiple Context-Free Grammar (SMCFG)

G = (N, T, F, P, S)N: finite set of nonterminals, T: finite set of terminals,F: finite set of functions,P: finite set of rules with probabilities, S N: start symbol

SCFG SMCFG

Nonterminalgenerates sequence generates tuple of sequences

Rule Sequence of nonterminals

and terminals

A0,…, Ak: nonterminalsf: function defined over seque

nces

16

Functions of SMCFG

Example:

17

Rules of SMCFG

Rule: : probability that the rule is applied The sum of the probabilities of the rules with

the same left hand side should be one. Example:

18

Derivation Trees in SMCFG

A1

Prob. p1…

Ak

Prob. pk

Prob.

A: f

…A1 Ak

19

Modeling Pseudoknot by SMCFG

UP2La[(x1, x2)] = (x1, ax2)

UP2Ru[(x1, x2)] = (x1, x2u)

(a g , a c u u)

(a g , a c u)

(a g , c u)

AProb.

0.7

B

A

Prob. 0.35

Prob. 0.28

20

SMCFG for RNA Pseudoknot Modeling W1,…,Wm: nonterminals

Note: W1 is the start symbol. For each rule, two real values called

transition probability p1 (0 < p1 1) and emission probability p2 (0 < p2 1) are specified.

Probability of each rule is defined as

21

SMCFG Gs

22

Table of Contents

Background Grammatical approach to RNA structure model

ing Model

Stochastic multiple context-free grammar Algorithms

Parsing and parameter estimation Experimental results

RNA pseudoknot prediction Summary

23

Algorithms for SMCFG

CYK algorithmcalculates the optimal alignment of a sequence to an SMCFG (the most likely derivation tree).

Inside algorithmcalculates the probability of a sequence given an SMCFG.

Inside-outside algorithmestimates optimal probability parameters for an SMCFG given a set of example sequences.

24

CYK Algorithm

Input: The following are calculated by dynamic progra

mming: : log maximum probability that Wv gener

ates : log maximum probability that Wy ge

nerates

25

CYK Algorithm (cont.)

Output: log maximum probability that W1 generates

i.e. : the most likely derivation tree : entire set of probability parameters

26

Algorithm [CYK]

Initialization:for i←1 to n+1, j←i to n+1, v←1 to m

do if // : empty sequencethen

else Iteration:

for i←n downto 1, j←i1 to n, k←n+1 downto j+1, l←k1 to n, v←1 to m// Some examples are shown.

27

Algorithm [CYK] (cont.)

if

Wv

Wy Wz

i h h+1 j k l1 n

x1 x21 x22

28

Algorithm [CYK] (cont.)

if

Wv

Wy

i l1i+1 j k l1 n

x1 x2ai al

29

Complexity of CYK Algorithm

m: # of nonterminals (m = a+b) n: sequence length

Time complexity: O(amn4+bn5) Space complexity: O(mn4)

30

Table of Contents

Background Grammatical approach to RNA structure model

ing Model

Stochastic multiple context-free grammar Algorithms

Parsing and parameter estimation Experimental results

RNA pseudoknot prediction Summary

31

Experimental Method

Construction of a model

RNA familydatabase

SMCFG

CUAGUCUUATest sequence

Secondarystructureprediction

CYK algorithm

Sample sequenceswith structure annotation

CUACUGUUC

parsing

32

Data Sets for Experiments

Three viral RNA families including pseudoknots from Rfam ver. 7.0

Family Rage of length # of annotated sequences

# of test sequences

Corona_pk_3 6264 14 10

HDV_ribozyme 8791 15 10

Tombus_3_IV 8992 18 12

33

Corona_pk_3 in Rfam ver. 7.0

Coronavirus 3' UTR pseudoknot

Sequence length:6264

Consensus structure

34

HDV_ribozyme in Rfam ver. 7.0

Hepatitis delta virus ribozyme

Sequence length:8791

Consensus structure

35

Tombus_3_IV in Rfam ver. 7.0

Tombusvirus 3' UTR region IV

Sequence length:8992

Consensus structure

36

Evaluation for Prediction Results

precision

=

recall

=

# of correct base pairs predicted by the algorithm# of predicted base pairs

# of correct base pairs predicted by the algorithm# of base pairs specified by the annotation

37

Experimental Results

Prediction accuracy

RNAfamily

Precision [%] Recall [%]

Average Min Max Average Min Max

Corona_pk_3 99.4 94.4 100.0 99.4 94.4 100.0

HDV_ribozyme 100.0 100.0 100.0 100.0 100.0 100.0

Tombus_3_IV 100.0 100.0 100.0 100.0 100.0 100.0

38

Experimental Results (cont.)

Running time

*: Implementation in ANSI C on a machine with Intel Pentium D CPU 2.80GHZ and 2.00GB RAM

RNAfamily

CPU time* [sec]

Average Min Max

Corona_pk_3 27.8 26.0 30.4

HDV_ribozyme 252.1 219.0 278.4

Tombus_3_IV 244.8 215.2 257.5

39

Pair Stochastic Tree Adjoining Grammar (PSTAG)[MSS05]

Derivation treerepresenting

known structure

Test sequence alignment

[MSS05] Matsui et al., “Pair stochastic tree adjoining grammars for aligningand predicting pseudoknot RNA structures,” Bioinformatics, 2005.

RNA familydatabase

PSTAG algorithm

CUAGUCUUA Secondarystructureprediction

Sample sequenceswith structure annotation

CUACUGUUC

40

Comparison with PSTAG

ModelAverage precision [%] Average recall [%]

Corona HDV Tombus Corona HDV Tombus

SMCFG 99.4 100.0 100.0 99.4 100.0 100.0

PSTAG 95.5 95.6 97.4 94.6 94.1 97.4

41

Summary

A new probabilistic model called SMCFG has been proposed for RNA pseudoknot modeling.

Polynomial time parsing and parameter estimation algorithms have been designed.

Experimental results on RNA pseudoknot prediction have shown good prediction accuracy.

top related