improving the prediction of rna secondary structure by detecting and assessing conserved stems

30
Improving the prediction of RNA secondary structure by detecting and assessing conserved stems Xiaoyong Fang, et al

Upload: takoda

Post on 15-Jan-2016

20 views

Category:

Documents


0 download

DESCRIPTION

Improving the prediction of RNA secondary structure by detecting and assessing conserved stems. Xiaoyong Fang, et al. Outline. Background Related work Methods Results Discussion. Background. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

Improving the prediction of RNA secondary structure by detecting and assessing conserved stems

Xiaoyong Fang et al

Outline

Background Related work Methods Results Discussion

Background Non-coding RNAs gained increasing interest

since a huge variety of functions associated with them were found

The function of an RNA molecule is principally determined by its (secondary) structure

Unfortunately the current physical methods available for structure determination are time-consuming and expensive

cont The prediction of RNA secondary structure can

be facilitated by incorporating with comparative analysis of homologous sequences

However most of existing comparative methods are vulnerable to alignment errors and thus are of low accuracy in practical application

And the methods for simultaneous sequence alignment and structure prediction are too computationally taxing to be practical

Related work Zuker et al Mfold 1981 2003 Knudsen B and J Hein Pfold 1999 2003 Hofacker I M Fekete and P Stadler

RNAalifold 2002 Ruan J et al 2004 Knight R et al 2004 Weinberg Z and WL Ruzzo 2004

cont Sankoff D 1985 Hofacker I S Bernhart and P Stadler 2004 Mathews D and D Turner 2002 Mathews D 2005 helliphellipetc

Method 1) We detect possible stems in single RNA sequence

using the so-called position matrix with which some possibly paired positions can be uncovered

2) We detect conserved stems across multiple RNA sequences by multiplying the position matrices

3) We assess the conserved stems using the Signal-to-Noise

4) We compute the optimized secondary structure by incorporating the so-called reliable conserved stems with predictions by RNAalifold program

Given an RNA sequence of length N Seq we build one NtimesN position matrix (denoted by MSeq) by

1) The reverse complement of Seq Seqrsquo is firstly computed from the original sequence

2) We build one NtimesN matrix containing Seqrsquo in the first row The i-th (0 le i le N-1) row contains the sequence generated from Seqrsquo by shifting i position to the left (circular left shift)

Detect possible stems in single RNA sequence

3) The position matrix MSeq is computed by comparing Seq with the matrix generated by 2) row by row 0 or 1 or -1 is assigned to the i j (0 le j le N-1) element of MSeq by comparing the j-th character of Seq with the i j (0 le j le N-1) element of the matrix of 2)

The following rules should be obeyed when two characters (b and brsquo) are compared with each other

1048698 If b equals brsquo and neither of them is lsquo_rsquo then 1 is assigned 1048698 If b or brsquo is lsquo_rsquo or both of them are lsquo_rsquo then -1 is assigned 1048698 If b does not equal to brsquo and neither of them is lsquo_rsquo then 0 is a

ssigned

Cont 1

We can detect all possible stems in an RNA sequence by scanning the position matrix row by row The key is to find all zones of continuous ldquo1rdquo in the matrix There is a one-to-one mapping between the stems in the sequence and the zones of continuous ldquo1rdquo in the position matrix

Cont 2

Example

The multiplying of the position

matrices M [i j] = M1 [i j] times M2 [i j]

0times0 = 0 0times1 = 0 0times (-1) = 0 1times1 = 1 1times (-1) = 1 (-1)times (-1) = -1

M =

Detecting conserved stems across multiple sequences

n

iiM

1

We detect conserved stems shared by all sequences in an alignment by

1) We extract all sequences from the alignment and detect all possible stems in each sequence using the approach described above

2) We select n different stems from n sequences (one stem for one sequence) Here n is the number of sequences in the alignment

3) We build the position matrix for each selected stem using the approach described above and multiply these n matrices according to the rules mentioned above

Cont 1

4) We detect conserved stems by finding zones of continuous lsquo1rsquo in the resulting matrix (M) generated by 3) There is still a one-to-one mapping between the conserved stems and the zones of continuous lsquo1rsquo in M

5) We repeat the steps 2) to 4) until all the stems detected by 1) are selected

Cont 2

Note Actually the multiplying of the position matrices can be

directly used to detect conserved stems in the RNA alignment But here we first extract all sequences from the RNA alignment and then detect conserved stems shared by all sequences The benefit for this is that some conserved stems possibly missed due to alignment errors also can be detected

Cont 3

ExampleFigure 2

The main steps are 1) We record the number of column pairs (see Figure 3) in

the conserved stem as the Signal

2) We generate a randomized alignment from the original alignment of the conserved stem

3) We detect possibly conserved stems in the new alignment using the approach mentioned above

Assessing the conserved stems using the Signal-to-Noise

4) We record the number of column pairs in the conserved stem newly detected by 3) as the Noise The Signal-to-Noise is thus computed by Signal Noise Note that we set Noise to be 1 if there are no conserved stems in the randomized alignment and we set Noise to be the maximum if there are more than one conserved stems

Cont

ExampleFigure 3

The main steps are 1) We compute a candidate secondary structure using

RNAalifold which is a popular and powerful program for predicting RNA secondary structure at present

2) We extract all sequences from the RNA alignment and detect all possibly conserved stems shared by them

3) We compute the Signal-to-Noise for each conserved stem

4) We remove incompatible conserved stems and determine the reliable conserved stems according to the Signal-to-Noise

Computing the optimized secondary structure

Here reliable conserved stems refer to following two kind of conserved stems

The conserved stems with very high Signal-to-Noise They are thought to belong to a real RNA secondary structure in our method

The conserved stems with very low Signal-to-Noise They are not thought to belong to any real RNA secondary structure in our method

Cont 1

5) We add the base pairs which are included by the first kind of reliable conserved stems but not included by the prediction by RNAalifold into the candidate secondary structure

6) We remove the base pairs which are included by both the second kind of reliable conserved stems and the prediction by RNAalifold from the candidate secondary structure

7) We compute the final secondary structure by merging the results of 5) and 6)

Cont 2

Results We construct one data set of three-sequence alignments one

data set of four-sequence alignments and one data set of five-sequence alignments on the basis of Rfam 80

We compute the sensitivity as the number of true positives divided by the sum of true positives + false negatives and the specificity as the number of true positives divided by the sum of true positives + false positives

Cont 1

Table 1 - Sensitivity and specificity on data set of three-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7598 7886 7497 7784

50-60 5029 5806 4859 5130

60-70 7317 6989 7279 6552

70-80 5667 4988 5587 4434

80-90 6796 6616 6782 6141

90-100 6256 5461 5944 5180

Total 6444 6291 6325 5870

Cont 2

Table 2 - Sensitivity and specificity on data set of four-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7516 8996 7460 8684

50-60 4865 5043 4348 4448

60-70 7488 5921 7456 5664

70-80 6501 5011 6190 4151

80-90 7318 6711 7151 6410

90-100 5028 5103 4870 4122

Total 6453 6312 6246 5580

Cont 3

Table 3 - Sensitivity and specificity on data set of five-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7256 8988 7253 8893

50-60 5101 5673 4920 4927

60-70 7691 7315 7634 7218

70-80 6499 4893 6209 3964

80-90 7176 6698 7171 6674

90-100 5135 5101 4732 4205

Total 6476 6445 6315 5980

Discussion In this paper we present a stem-based

method to improve the prediction of RNA secondary structure The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 2: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

Outline

Background Related work Methods Results Discussion

Background Non-coding RNAs gained increasing interest

since a huge variety of functions associated with them were found

The function of an RNA molecule is principally determined by its (secondary) structure

Unfortunately the current physical methods available for structure determination are time-consuming and expensive

cont The prediction of RNA secondary structure can

be facilitated by incorporating with comparative analysis of homologous sequences

However most of existing comparative methods are vulnerable to alignment errors and thus are of low accuracy in practical application

And the methods for simultaneous sequence alignment and structure prediction are too computationally taxing to be practical

Related work Zuker et al Mfold 1981 2003 Knudsen B and J Hein Pfold 1999 2003 Hofacker I M Fekete and P Stadler

RNAalifold 2002 Ruan J et al 2004 Knight R et al 2004 Weinberg Z and WL Ruzzo 2004

cont Sankoff D 1985 Hofacker I S Bernhart and P Stadler 2004 Mathews D and D Turner 2002 Mathews D 2005 helliphellipetc

Method 1) We detect possible stems in single RNA sequence

using the so-called position matrix with which some possibly paired positions can be uncovered

2) We detect conserved stems across multiple RNA sequences by multiplying the position matrices

3) We assess the conserved stems using the Signal-to-Noise

4) We compute the optimized secondary structure by incorporating the so-called reliable conserved stems with predictions by RNAalifold program

Given an RNA sequence of length N Seq we build one NtimesN position matrix (denoted by MSeq) by

1) The reverse complement of Seq Seqrsquo is firstly computed from the original sequence

2) We build one NtimesN matrix containing Seqrsquo in the first row The i-th (0 le i le N-1) row contains the sequence generated from Seqrsquo by shifting i position to the left (circular left shift)

Detect possible stems in single RNA sequence

3) The position matrix MSeq is computed by comparing Seq with the matrix generated by 2) row by row 0 or 1 or -1 is assigned to the i j (0 le j le N-1) element of MSeq by comparing the j-th character of Seq with the i j (0 le j le N-1) element of the matrix of 2)

The following rules should be obeyed when two characters (b and brsquo) are compared with each other

1048698 If b equals brsquo and neither of them is lsquo_rsquo then 1 is assigned 1048698 If b or brsquo is lsquo_rsquo or both of them are lsquo_rsquo then -1 is assigned 1048698 If b does not equal to brsquo and neither of them is lsquo_rsquo then 0 is a

ssigned

Cont 1

We can detect all possible stems in an RNA sequence by scanning the position matrix row by row The key is to find all zones of continuous ldquo1rdquo in the matrix There is a one-to-one mapping between the stems in the sequence and the zones of continuous ldquo1rdquo in the position matrix

Cont 2

Example

The multiplying of the position

matrices M [i j] = M1 [i j] times M2 [i j]

0times0 = 0 0times1 = 0 0times (-1) = 0 1times1 = 1 1times (-1) = 1 (-1)times (-1) = -1

M =

Detecting conserved stems across multiple sequences

n

iiM

1

We detect conserved stems shared by all sequences in an alignment by

1) We extract all sequences from the alignment and detect all possible stems in each sequence using the approach described above

2) We select n different stems from n sequences (one stem for one sequence) Here n is the number of sequences in the alignment

3) We build the position matrix for each selected stem using the approach described above and multiply these n matrices according to the rules mentioned above

Cont 1

4) We detect conserved stems by finding zones of continuous lsquo1rsquo in the resulting matrix (M) generated by 3) There is still a one-to-one mapping between the conserved stems and the zones of continuous lsquo1rsquo in M

5) We repeat the steps 2) to 4) until all the stems detected by 1) are selected

Cont 2

Note Actually the multiplying of the position matrices can be

directly used to detect conserved stems in the RNA alignment But here we first extract all sequences from the RNA alignment and then detect conserved stems shared by all sequences The benefit for this is that some conserved stems possibly missed due to alignment errors also can be detected

Cont 3

ExampleFigure 2

The main steps are 1) We record the number of column pairs (see Figure 3) in

the conserved stem as the Signal

2) We generate a randomized alignment from the original alignment of the conserved stem

3) We detect possibly conserved stems in the new alignment using the approach mentioned above

Assessing the conserved stems using the Signal-to-Noise

4) We record the number of column pairs in the conserved stem newly detected by 3) as the Noise The Signal-to-Noise is thus computed by Signal Noise Note that we set Noise to be 1 if there are no conserved stems in the randomized alignment and we set Noise to be the maximum if there are more than one conserved stems

Cont

ExampleFigure 3

The main steps are 1) We compute a candidate secondary structure using

RNAalifold which is a popular and powerful program for predicting RNA secondary structure at present

2) We extract all sequences from the RNA alignment and detect all possibly conserved stems shared by them

3) We compute the Signal-to-Noise for each conserved stem

4) We remove incompatible conserved stems and determine the reliable conserved stems according to the Signal-to-Noise

Computing the optimized secondary structure

Here reliable conserved stems refer to following two kind of conserved stems

The conserved stems with very high Signal-to-Noise They are thought to belong to a real RNA secondary structure in our method

The conserved stems with very low Signal-to-Noise They are not thought to belong to any real RNA secondary structure in our method

Cont 1

5) We add the base pairs which are included by the first kind of reliable conserved stems but not included by the prediction by RNAalifold into the candidate secondary structure

6) We remove the base pairs which are included by both the second kind of reliable conserved stems and the prediction by RNAalifold from the candidate secondary structure

7) We compute the final secondary structure by merging the results of 5) and 6)

Cont 2

Results We construct one data set of three-sequence alignments one

data set of four-sequence alignments and one data set of five-sequence alignments on the basis of Rfam 80

We compute the sensitivity as the number of true positives divided by the sum of true positives + false negatives and the specificity as the number of true positives divided by the sum of true positives + false positives

Cont 1

Table 1 - Sensitivity and specificity on data set of three-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7598 7886 7497 7784

50-60 5029 5806 4859 5130

60-70 7317 6989 7279 6552

70-80 5667 4988 5587 4434

80-90 6796 6616 6782 6141

90-100 6256 5461 5944 5180

Total 6444 6291 6325 5870

Cont 2

Table 2 - Sensitivity and specificity on data set of four-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7516 8996 7460 8684

50-60 4865 5043 4348 4448

60-70 7488 5921 7456 5664

70-80 6501 5011 6190 4151

80-90 7318 6711 7151 6410

90-100 5028 5103 4870 4122

Total 6453 6312 6246 5580

Cont 3

Table 3 - Sensitivity and specificity on data set of five-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7256 8988 7253 8893

50-60 5101 5673 4920 4927

60-70 7691 7315 7634 7218

70-80 6499 4893 6209 3964

80-90 7176 6698 7171 6674

90-100 5135 5101 4732 4205

Total 6476 6445 6315 5980

Discussion In this paper we present a stem-based

method to improve the prediction of RNA secondary structure The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 3: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

Background Non-coding RNAs gained increasing interest

since a huge variety of functions associated with them were found

The function of an RNA molecule is principally determined by its (secondary) structure

Unfortunately the current physical methods available for structure determination are time-consuming and expensive

cont The prediction of RNA secondary structure can

be facilitated by incorporating with comparative analysis of homologous sequences

However most of existing comparative methods are vulnerable to alignment errors and thus are of low accuracy in practical application

And the methods for simultaneous sequence alignment and structure prediction are too computationally taxing to be practical

Related work Zuker et al Mfold 1981 2003 Knudsen B and J Hein Pfold 1999 2003 Hofacker I M Fekete and P Stadler

RNAalifold 2002 Ruan J et al 2004 Knight R et al 2004 Weinberg Z and WL Ruzzo 2004

cont Sankoff D 1985 Hofacker I S Bernhart and P Stadler 2004 Mathews D and D Turner 2002 Mathews D 2005 helliphellipetc

Method 1) We detect possible stems in single RNA sequence

using the so-called position matrix with which some possibly paired positions can be uncovered

2) We detect conserved stems across multiple RNA sequences by multiplying the position matrices

3) We assess the conserved stems using the Signal-to-Noise

4) We compute the optimized secondary structure by incorporating the so-called reliable conserved stems with predictions by RNAalifold program

Given an RNA sequence of length N Seq we build one NtimesN position matrix (denoted by MSeq) by

1) The reverse complement of Seq Seqrsquo is firstly computed from the original sequence

2) We build one NtimesN matrix containing Seqrsquo in the first row The i-th (0 le i le N-1) row contains the sequence generated from Seqrsquo by shifting i position to the left (circular left shift)

Detect possible stems in single RNA sequence

3) The position matrix MSeq is computed by comparing Seq with the matrix generated by 2) row by row 0 or 1 or -1 is assigned to the i j (0 le j le N-1) element of MSeq by comparing the j-th character of Seq with the i j (0 le j le N-1) element of the matrix of 2)

The following rules should be obeyed when two characters (b and brsquo) are compared with each other

1048698 If b equals brsquo and neither of them is lsquo_rsquo then 1 is assigned 1048698 If b or brsquo is lsquo_rsquo or both of them are lsquo_rsquo then -1 is assigned 1048698 If b does not equal to brsquo and neither of them is lsquo_rsquo then 0 is a

ssigned

Cont 1

We can detect all possible stems in an RNA sequence by scanning the position matrix row by row The key is to find all zones of continuous ldquo1rdquo in the matrix There is a one-to-one mapping between the stems in the sequence and the zones of continuous ldquo1rdquo in the position matrix

Cont 2

Example

The multiplying of the position

matrices M [i j] = M1 [i j] times M2 [i j]

0times0 = 0 0times1 = 0 0times (-1) = 0 1times1 = 1 1times (-1) = 1 (-1)times (-1) = -1

M =

Detecting conserved stems across multiple sequences

n

iiM

1

We detect conserved stems shared by all sequences in an alignment by

1) We extract all sequences from the alignment and detect all possible stems in each sequence using the approach described above

2) We select n different stems from n sequences (one stem for one sequence) Here n is the number of sequences in the alignment

3) We build the position matrix for each selected stem using the approach described above and multiply these n matrices according to the rules mentioned above

Cont 1

4) We detect conserved stems by finding zones of continuous lsquo1rsquo in the resulting matrix (M) generated by 3) There is still a one-to-one mapping between the conserved stems and the zones of continuous lsquo1rsquo in M

5) We repeat the steps 2) to 4) until all the stems detected by 1) are selected

Cont 2

Note Actually the multiplying of the position matrices can be

directly used to detect conserved stems in the RNA alignment But here we first extract all sequences from the RNA alignment and then detect conserved stems shared by all sequences The benefit for this is that some conserved stems possibly missed due to alignment errors also can be detected

Cont 3

ExampleFigure 2

The main steps are 1) We record the number of column pairs (see Figure 3) in

the conserved stem as the Signal

2) We generate a randomized alignment from the original alignment of the conserved stem

3) We detect possibly conserved stems in the new alignment using the approach mentioned above

Assessing the conserved stems using the Signal-to-Noise

4) We record the number of column pairs in the conserved stem newly detected by 3) as the Noise The Signal-to-Noise is thus computed by Signal Noise Note that we set Noise to be 1 if there are no conserved stems in the randomized alignment and we set Noise to be the maximum if there are more than one conserved stems

Cont

ExampleFigure 3

The main steps are 1) We compute a candidate secondary structure using

RNAalifold which is a popular and powerful program for predicting RNA secondary structure at present

2) We extract all sequences from the RNA alignment and detect all possibly conserved stems shared by them

3) We compute the Signal-to-Noise for each conserved stem

4) We remove incompatible conserved stems and determine the reliable conserved stems according to the Signal-to-Noise

Computing the optimized secondary structure

Here reliable conserved stems refer to following two kind of conserved stems

The conserved stems with very high Signal-to-Noise They are thought to belong to a real RNA secondary structure in our method

The conserved stems with very low Signal-to-Noise They are not thought to belong to any real RNA secondary structure in our method

Cont 1

5) We add the base pairs which are included by the first kind of reliable conserved stems but not included by the prediction by RNAalifold into the candidate secondary structure

6) We remove the base pairs which are included by both the second kind of reliable conserved stems and the prediction by RNAalifold from the candidate secondary structure

7) We compute the final secondary structure by merging the results of 5) and 6)

Cont 2

Results We construct one data set of three-sequence alignments one

data set of four-sequence alignments and one data set of five-sequence alignments on the basis of Rfam 80

We compute the sensitivity as the number of true positives divided by the sum of true positives + false negatives and the specificity as the number of true positives divided by the sum of true positives + false positives

Cont 1

Table 1 - Sensitivity and specificity on data set of three-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7598 7886 7497 7784

50-60 5029 5806 4859 5130

60-70 7317 6989 7279 6552

70-80 5667 4988 5587 4434

80-90 6796 6616 6782 6141

90-100 6256 5461 5944 5180

Total 6444 6291 6325 5870

Cont 2

Table 2 - Sensitivity and specificity on data set of four-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7516 8996 7460 8684

50-60 4865 5043 4348 4448

60-70 7488 5921 7456 5664

70-80 6501 5011 6190 4151

80-90 7318 6711 7151 6410

90-100 5028 5103 4870 4122

Total 6453 6312 6246 5580

Cont 3

Table 3 - Sensitivity and specificity on data set of five-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7256 8988 7253 8893

50-60 5101 5673 4920 4927

60-70 7691 7315 7634 7218

70-80 6499 4893 6209 3964

80-90 7176 6698 7171 6674

90-100 5135 5101 4732 4205

Total 6476 6445 6315 5980

Discussion In this paper we present a stem-based

method to improve the prediction of RNA secondary structure The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 4: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

cont The prediction of RNA secondary structure can

be facilitated by incorporating with comparative analysis of homologous sequences

However most of existing comparative methods are vulnerable to alignment errors and thus are of low accuracy in practical application

And the methods for simultaneous sequence alignment and structure prediction are too computationally taxing to be practical

Related work Zuker et al Mfold 1981 2003 Knudsen B and J Hein Pfold 1999 2003 Hofacker I M Fekete and P Stadler

RNAalifold 2002 Ruan J et al 2004 Knight R et al 2004 Weinberg Z and WL Ruzzo 2004

cont Sankoff D 1985 Hofacker I S Bernhart and P Stadler 2004 Mathews D and D Turner 2002 Mathews D 2005 helliphellipetc

Method 1) We detect possible stems in single RNA sequence

using the so-called position matrix with which some possibly paired positions can be uncovered

2) We detect conserved stems across multiple RNA sequences by multiplying the position matrices

3) We assess the conserved stems using the Signal-to-Noise

4) We compute the optimized secondary structure by incorporating the so-called reliable conserved stems with predictions by RNAalifold program

Given an RNA sequence of length N Seq we build one NtimesN position matrix (denoted by MSeq) by

1) The reverse complement of Seq Seqrsquo is firstly computed from the original sequence

2) We build one NtimesN matrix containing Seqrsquo in the first row The i-th (0 le i le N-1) row contains the sequence generated from Seqrsquo by shifting i position to the left (circular left shift)

Detect possible stems in single RNA sequence

3) The position matrix MSeq is computed by comparing Seq with the matrix generated by 2) row by row 0 or 1 or -1 is assigned to the i j (0 le j le N-1) element of MSeq by comparing the j-th character of Seq with the i j (0 le j le N-1) element of the matrix of 2)

The following rules should be obeyed when two characters (b and brsquo) are compared with each other

1048698 If b equals brsquo and neither of them is lsquo_rsquo then 1 is assigned 1048698 If b or brsquo is lsquo_rsquo or both of them are lsquo_rsquo then -1 is assigned 1048698 If b does not equal to brsquo and neither of them is lsquo_rsquo then 0 is a

ssigned

Cont 1

We can detect all possible stems in an RNA sequence by scanning the position matrix row by row The key is to find all zones of continuous ldquo1rdquo in the matrix There is a one-to-one mapping between the stems in the sequence and the zones of continuous ldquo1rdquo in the position matrix

Cont 2

Example

The multiplying of the position

matrices M [i j] = M1 [i j] times M2 [i j]

0times0 = 0 0times1 = 0 0times (-1) = 0 1times1 = 1 1times (-1) = 1 (-1)times (-1) = -1

M =

Detecting conserved stems across multiple sequences

n

iiM

1

We detect conserved stems shared by all sequences in an alignment by

1) We extract all sequences from the alignment and detect all possible stems in each sequence using the approach described above

2) We select n different stems from n sequences (one stem for one sequence) Here n is the number of sequences in the alignment

3) We build the position matrix for each selected stem using the approach described above and multiply these n matrices according to the rules mentioned above

Cont 1

4) We detect conserved stems by finding zones of continuous lsquo1rsquo in the resulting matrix (M) generated by 3) There is still a one-to-one mapping between the conserved stems and the zones of continuous lsquo1rsquo in M

5) We repeat the steps 2) to 4) until all the stems detected by 1) are selected

Cont 2

Note Actually the multiplying of the position matrices can be

directly used to detect conserved stems in the RNA alignment But here we first extract all sequences from the RNA alignment and then detect conserved stems shared by all sequences The benefit for this is that some conserved stems possibly missed due to alignment errors also can be detected

Cont 3

ExampleFigure 2

The main steps are 1) We record the number of column pairs (see Figure 3) in

the conserved stem as the Signal

2) We generate a randomized alignment from the original alignment of the conserved stem

3) We detect possibly conserved stems in the new alignment using the approach mentioned above

Assessing the conserved stems using the Signal-to-Noise

4) We record the number of column pairs in the conserved stem newly detected by 3) as the Noise The Signal-to-Noise is thus computed by Signal Noise Note that we set Noise to be 1 if there are no conserved stems in the randomized alignment and we set Noise to be the maximum if there are more than one conserved stems

Cont

ExampleFigure 3

The main steps are 1) We compute a candidate secondary structure using

RNAalifold which is a popular and powerful program for predicting RNA secondary structure at present

2) We extract all sequences from the RNA alignment and detect all possibly conserved stems shared by them

3) We compute the Signal-to-Noise for each conserved stem

4) We remove incompatible conserved stems and determine the reliable conserved stems according to the Signal-to-Noise

Computing the optimized secondary structure

Here reliable conserved stems refer to following two kind of conserved stems

The conserved stems with very high Signal-to-Noise They are thought to belong to a real RNA secondary structure in our method

The conserved stems with very low Signal-to-Noise They are not thought to belong to any real RNA secondary structure in our method

Cont 1

5) We add the base pairs which are included by the first kind of reliable conserved stems but not included by the prediction by RNAalifold into the candidate secondary structure

6) We remove the base pairs which are included by both the second kind of reliable conserved stems and the prediction by RNAalifold from the candidate secondary structure

7) We compute the final secondary structure by merging the results of 5) and 6)

Cont 2

Results We construct one data set of three-sequence alignments one

data set of four-sequence alignments and one data set of five-sequence alignments on the basis of Rfam 80

We compute the sensitivity as the number of true positives divided by the sum of true positives + false negatives and the specificity as the number of true positives divided by the sum of true positives + false positives

Cont 1

Table 1 - Sensitivity and specificity on data set of three-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7598 7886 7497 7784

50-60 5029 5806 4859 5130

60-70 7317 6989 7279 6552

70-80 5667 4988 5587 4434

80-90 6796 6616 6782 6141

90-100 6256 5461 5944 5180

Total 6444 6291 6325 5870

Cont 2

Table 2 - Sensitivity and specificity on data set of four-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7516 8996 7460 8684

50-60 4865 5043 4348 4448

60-70 7488 5921 7456 5664

70-80 6501 5011 6190 4151

80-90 7318 6711 7151 6410

90-100 5028 5103 4870 4122

Total 6453 6312 6246 5580

Cont 3

Table 3 - Sensitivity and specificity on data set of five-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7256 8988 7253 8893

50-60 5101 5673 4920 4927

60-70 7691 7315 7634 7218

70-80 6499 4893 6209 3964

80-90 7176 6698 7171 6674

90-100 5135 5101 4732 4205

Total 6476 6445 6315 5980

Discussion In this paper we present a stem-based

method to improve the prediction of RNA secondary structure The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 5: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

Related work Zuker et al Mfold 1981 2003 Knudsen B and J Hein Pfold 1999 2003 Hofacker I M Fekete and P Stadler

RNAalifold 2002 Ruan J et al 2004 Knight R et al 2004 Weinberg Z and WL Ruzzo 2004

cont Sankoff D 1985 Hofacker I S Bernhart and P Stadler 2004 Mathews D and D Turner 2002 Mathews D 2005 helliphellipetc

Method 1) We detect possible stems in single RNA sequence

using the so-called position matrix with which some possibly paired positions can be uncovered

2) We detect conserved stems across multiple RNA sequences by multiplying the position matrices

3) We assess the conserved stems using the Signal-to-Noise

4) We compute the optimized secondary structure by incorporating the so-called reliable conserved stems with predictions by RNAalifold program

Given an RNA sequence of length N Seq we build one NtimesN position matrix (denoted by MSeq) by

1) The reverse complement of Seq Seqrsquo is firstly computed from the original sequence

2) We build one NtimesN matrix containing Seqrsquo in the first row The i-th (0 le i le N-1) row contains the sequence generated from Seqrsquo by shifting i position to the left (circular left shift)

Detect possible stems in single RNA sequence

3) The position matrix MSeq is computed by comparing Seq with the matrix generated by 2) row by row 0 or 1 or -1 is assigned to the i j (0 le j le N-1) element of MSeq by comparing the j-th character of Seq with the i j (0 le j le N-1) element of the matrix of 2)

The following rules should be obeyed when two characters (b and brsquo) are compared with each other

1048698 If b equals brsquo and neither of them is lsquo_rsquo then 1 is assigned 1048698 If b or brsquo is lsquo_rsquo or both of them are lsquo_rsquo then -1 is assigned 1048698 If b does not equal to brsquo and neither of them is lsquo_rsquo then 0 is a

ssigned

Cont 1

We can detect all possible stems in an RNA sequence by scanning the position matrix row by row The key is to find all zones of continuous ldquo1rdquo in the matrix There is a one-to-one mapping between the stems in the sequence and the zones of continuous ldquo1rdquo in the position matrix

Cont 2

Example

The multiplying of the position

matrices M [i j] = M1 [i j] times M2 [i j]

0times0 = 0 0times1 = 0 0times (-1) = 0 1times1 = 1 1times (-1) = 1 (-1)times (-1) = -1

M =

Detecting conserved stems across multiple sequences

n

iiM

1

We detect conserved stems shared by all sequences in an alignment by

1) We extract all sequences from the alignment and detect all possible stems in each sequence using the approach described above

2) We select n different stems from n sequences (one stem for one sequence) Here n is the number of sequences in the alignment

3) We build the position matrix for each selected stem using the approach described above and multiply these n matrices according to the rules mentioned above

Cont 1

4) We detect conserved stems by finding zones of continuous lsquo1rsquo in the resulting matrix (M) generated by 3) There is still a one-to-one mapping between the conserved stems and the zones of continuous lsquo1rsquo in M

5) We repeat the steps 2) to 4) until all the stems detected by 1) are selected

Cont 2

Note Actually the multiplying of the position matrices can be

directly used to detect conserved stems in the RNA alignment But here we first extract all sequences from the RNA alignment and then detect conserved stems shared by all sequences The benefit for this is that some conserved stems possibly missed due to alignment errors also can be detected

Cont 3

ExampleFigure 2

The main steps are 1) We record the number of column pairs (see Figure 3) in

the conserved stem as the Signal

2) We generate a randomized alignment from the original alignment of the conserved stem

3) We detect possibly conserved stems in the new alignment using the approach mentioned above

Assessing the conserved stems using the Signal-to-Noise

4) We record the number of column pairs in the conserved stem newly detected by 3) as the Noise The Signal-to-Noise is thus computed by Signal Noise Note that we set Noise to be 1 if there are no conserved stems in the randomized alignment and we set Noise to be the maximum if there are more than one conserved stems

Cont

ExampleFigure 3

The main steps are 1) We compute a candidate secondary structure using

RNAalifold which is a popular and powerful program for predicting RNA secondary structure at present

2) We extract all sequences from the RNA alignment and detect all possibly conserved stems shared by them

3) We compute the Signal-to-Noise for each conserved stem

4) We remove incompatible conserved stems and determine the reliable conserved stems according to the Signal-to-Noise

Computing the optimized secondary structure

Here reliable conserved stems refer to following two kind of conserved stems

The conserved stems with very high Signal-to-Noise They are thought to belong to a real RNA secondary structure in our method

The conserved stems with very low Signal-to-Noise They are not thought to belong to any real RNA secondary structure in our method

Cont 1

5) We add the base pairs which are included by the first kind of reliable conserved stems but not included by the prediction by RNAalifold into the candidate secondary structure

6) We remove the base pairs which are included by both the second kind of reliable conserved stems and the prediction by RNAalifold from the candidate secondary structure

7) We compute the final secondary structure by merging the results of 5) and 6)

Cont 2

Results We construct one data set of three-sequence alignments one

data set of four-sequence alignments and one data set of five-sequence alignments on the basis of Rfam 80

We compute the sensitivity as the number of true positives divided by the sum of true positives + false negatives and the specificity as the number of true positives divided by the sum of true positives + false positives

Cont 1

Table 1 - Sensitivity and specificity on data set of three-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7598 7886 7497 7784

50-60 5029 5806 4859 5130

60-70 7317 6989 7279 6552

70-80 5667 4988 5587 4434

80-90 6796 6616 6782 6141

90-100 6256 5461 5944 5180

Total 6444 6291 6325 5870

Cont 2

Table 2 - Sensitivity and specificity on data set of four-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7516 8996 7460 8684

50-60 4865 5043 4348 4448

60-70 7488 5921 7456 5664

70-80 6501 5011 6190 4151

80-90 7318 6711 7151 6410

90-100 5028 5103 4870 4122

Total 6453 6312 6246 5580

Cont 3

Table 3 - Sensitivity and specificity on data set of five-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7256 8988 7253 8893

50-60 5101 5673 4920 4927

60-70 7691 7315 7634 7218

70-80 6499 4893 6209 3964

80-90 7176 6698 7171 6674

90-100 5135 5101 4732 4205

Total 6476 6445 6315 5980

Discussion In this paper we present a stem-based

method to improve the prediction of RNA secondary structure The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 6: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

cont Sankoff D 1985 Hofacker I S Bernhart and P Stadler 2004 Mathews D and D Turner 2002 Mathews D 2005 helliphellipetc

Method 1) We detect possible stems in single RNA sequence

using the so-called position matrix with which some possibly paired positions can be uncovered

2) We detect conserved stems across multiple RNA sequences by multiplying the position matrices

3) We assess the conserved stems using the Signal-to-Noise

4) We compute the optimized secondary structure by incorporating the so-called reliable conserved stems with predictions by RNAalifold program

Given an RNA sequence of length N Seq we build one NtimesN position matrix (denoted by MSeq) by

1) The reverse complement of Seq Seqrsquo is firstly computed from the original sequence

2) We build one NtimesN matrix containing Seqrsquo in the first row The i-th (0 le i le N-1) row contains the sequence generated from Seqrsquo by shifting i position to the left (circular left shift)

Detect possible stems in single RNA sequence

3) The position matrix MSeq is computed by comparing Seq with the matrix generated by 2) row by row 0 or 1 or -1 is assigned to the i j (0 le j le N-1) element of MSeq by comparing the j-th character of Seq with the i j (0 le j le N-1) element of the matrix of 2)

The following rules should be obeyed when two characters (b and brsquo) are compared with each other

1048698 If b equals brsquo and neither of them is lsquo_rsquo then 1 is assigned 1048698 If b or brsquo is lsquo_rsquo or both of them are lsquo_rsquo then -1 is assigned 1048698 If b does not equal to brsquo and neither of them is lsquo_rsquo then 0 is a

ssigned

Cont 1

We can detect all possible stems in an RNA sequence by scanning the position matrix row by row The key is to find all zones of continuous ldquo1rdquo in the matrix There is a one-to-one mapping between the stems in the sequence and the zones of continuous ldquo1rdquo in the position matrix

Cont 2

Example

The multiplying of the position

matrices M [i j] = M1 [i j] times M2 [i j]

0times0 = 0 0times1 = 0 0times (-1) = 0 1times1 = 1 1times (-1) = 1 (-1)times (-1) = -1

M =

Detecting conserved stems across multiple sequences

n

iiM

1

We detect conserved stems shared by all sequences in an alignment by

1) We extract all sequences from the alignment and detect all possible stems in each sequence using the approach described above

2) We select n different stems from n sequences (one stem for one sequence) Here n is the number of sequences in the alignment

3) We build the position matrix for each selected stem using the approach described above and multiply these n matrices according to the rules mentioned above

Cont 1

4) We detect conserved stems by finding zones of continuous lsquo1rsquo in the resulting matrix (M) generated by 3) There is still a one-to-one mapping between the conserved stems and the zones of continuous lsquo1rsquo in M

5) We repeat the steps 2) to 4) until all the stems detected by 1) are selected

Cont 2

Note Actually the multiplying of the position matrices can be

directly used to detect conserved stems in the RNA alignment But here we first extract all sequences from the RNA alignment and then detect conserved stems shared by all sequences The benefit for this is that some conserved stems possibly missed due to alignment errors also can be detected

Cont 3

ExampleFigure 2

The main steps are 1) We record the number of column pairs (see Figure 3) in

the conserved stem as the Signal

2) We generate a randomized alignment from the original alignment of the conserved stem

3) We detect possibly conserved stems in the new alignment using the approach mentioned above

Assessing the conserved stems using the Signal-to-Noise

4) We record the number of column pairs in the conserved stem newly detected by 3) as the Noise The Signal-to-Noise is thus computed by Signal Noise Note that we set Noise to be 1 if there are no conserved stems in the randomized alignment and we set Noise to be the maximum if there are more than one conserved stems

Cont

ExampleFigure 3

The main steps are 1) We compute a candidate secondary structure using

RNAalifold which is a popular and powerful program for predicting RNA secondary structure at present

2) We extract all sequences from the RNA alignment and detect all possibly conserved stems shared by them

3) We compute the Signal-to-Noise for each conserved stem

4) We remove incompatible conserved stems and determine the reliable conserved stems according to the Signal-to-Noise

Computing the optimized secondary structure

Here reliable conserved stems refer to following two kind of conserved stems

The conserved stems with very high Signal-to-Noise They are thought to belong to a real RNA secondary structure in our method

The conserved stems with very low Signal-to-Noise They are not thought to belong to any real RNA secondary structure in our method

Cont 1

5) We add the base pairs which are included by the first kind of reliable conserved stems but not included by the prediction by RNAalifold into the candidate secondary structure

6) We remove the base pairs which are included by both the second kind of reliable conserved stems and the prediction by RNAalifold from the candidate secondary structure

7) We compute the final secondary structure by merging the results of 5) and 6)

Cont 2

Results We construct one data set of three-sequence alignments one

data set of four-sequence alignments and one data set of five-sequence alignments on the basis of Rfam 80

We compute the sensitivity as the number of true positives divided by the sum of true positives + false negatives and the specificity as the number of true positives divided by the sum of true positives + false positives

Cont 1

Table 1 - Sensitivity and specificity on data set of three-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7598 7886 7497 7784

50-60 5029 5806 4859 5130

60-70 7317 6989 7279 6552

70-80 5667 4988 5587 4434

80-90 6796 6616 6782 6141

90-100 6256 5461 5944 5180

Total 6444 6291 6325 5870

Cont 2

Table 2 - Sensitivity and specificity on data set of four-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7516 8996 7460 8684

50-60 4865 5043 4348 4448

60-70 7488 5921 7456 5664

70-80 6501 5011 6190 4151

80-90 7318 6711 7151 6410

90-100 5028 5103 4870 4122

Total 6453 6312 6246 5580

Cont 3

Table 3 - Sensitivity and specificity on data set of five-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7256 8988 7253 8893

50-60 5101 5673 4920 4927

60-70 7691 7315 7634 7218

70-80 6499 4893 6209 3964

80-90 7176 6698 7171 6674

90-100 5135 5101 4732 4205

Total 6476 6445 6315 5980

Discussion In this paper we present a stem-based

method to improve the prediction of RNA secondary structure The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 7: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

Method 1) We detect possible stems in single RNA sequence

using the so-called position matrix with which some possibly paired positions can be uncovered

2) We detect conserved stems across multiple RNA sequences by multiplying the position matrices

3) We assess the conserved stems using the Signal-to-Noise

4) We compute the optimized secondary structure by incorporating the so-called reliable conserved stems with predictions by RNAalifold program

Given an RNA sequence of length N Seq we build one NtimesN position matrix (denoted by MSeq) by

1) The reverse complement of Seq Seqrsquo is firstly computed from the original sequence

2) We build one NtimesN matrix containing Seqrsquo in the first row The i-th (0 le i le N-1) row contains the sequence generated from Seqrsquo by shifting i position to the left (circular left shift)

Detect possible stems in single RNA sequence

3) The position matrix MSeq is computed by comparing Seq with the matrix generated by 2) row by row 0 or 1 or -1 is assigned to the i j (0 le j le N-1) element of MSeq by comparing the j-th character of Seq with the i j (0 le j le N-1) element of the matrix of 2)

The following rules should be obeyed when two characters (b and brsquo) are compared with each other

1048698 If b equals brsquo and neither of them is lsquo_rsquo then 1 is assigned 1048698 If b or brsquo is lsquo_rsquo or both of them are lsquo_rsquo then -1 is assigned 1048698 If b does not equal to brsquo and neither of them is lsquo_rsquo then 0 is a

ssigned

Cont 1

We can detect all possible stems in an RNA sequence by scanning the position matrix row by row The key is to find all zones of continuous ldquo1rdquo in the matrix There is a one-to-one mapping between the stems in the sequence and the zones of continuous ldquo1rdquo in the position matrix

Cont 2

Example

The multiplying of the position

matrices M [i j] = M1 [i j] times M2 [i j]

0times0 = 0 0times1 = 0 0times (-1) = 0 1times1 = 1 1times (-1) = 1 (-1)times (-1) = -1

M =

Detecting conserved stems across multiple sequences

n

iiM

1

We detect conserved stems shared by all sequences in an alignment by

1) We extract all sequences from the alignment and detect all possible stems in each sequence using the approach described above

2) We select n different stems from n sequences (one stem for one sequence) Here n is the number of sequences in the alignment

3) We build the position matrix for each selected stem using the approach described above and multiply these n matrices according to the rules mentioned above

Cont 1

4) We detect conserved stems by finding zones of continuous lsquo1rsquo in the resulting matrix (M) generated by 3) There is still a one-to-one mapping between the conserved stems and the zones of continuous lsquo1rsquo in M

5) We repeat the steps 2) to 4) until all the stems detected by 1) are selected

Cont 2

Note Actually the multiplying of the position matrices can be

directly used to detect conserved stems in the RNA alignment But here we first extract all sequences from the RNA alignment and then detect conserved stems shared by all sequences The benefit for this is that some conserved stems possibly missed due to alignment errors also can be detected

Cont 3

ExampleFigure 2

The main steps are 1) We record the number of column pairs (see Figure 3) in

the conserved stem as the Signal

2) We generate a randomized alignment from the original alignment of the conserved stem

3) We detect possibly conserved stems in the new alignment using the approach mentioned above

Assessing the conserved stems using the Signal-to-Noise

4) We record the number of column pairs in the conserved stem newly detected by 3) as the Noise The Signal-to-Noise is thus computed by Signal Noise Note that we set Noise to be 1 if there are no conserved stems in the randomized alignment and we set Noise to be the maximum if there are more than one conserved stems

Cont

ExampleFigure 3

The main steps are 1) We compute a candidate secondary structure using

RNAalifold which is a popular and powerful program for predicting RNA secondary structure at present

2) We extract all sequences from the RNA alignment and detect all possibly conserved stems shared by them

3) We compute the Signal-to-Noise for each conserved stem

4) We remove incompatible conserved stems and determine the reliable conserved stems according to the Signal-to-Noise

Computing the optimized secondary structure

Here reliable conserved stems refer to following two kind of conserved stems

The conserved stems with very high Signal-to-Noise They are thought to belong to a real RNA secondary structure in our method

The conserved stems with very low Signal-to-Noise They are not thought to belong to any real RNA secondary structure in our method

Cont 1

5) We add the base pairs which are included by the first kind of reliable conserved stems but not included by the prediction by RNAalifold into the candidate secondary structure

6) We remove the base pairs which are included by both the second kind of reliable conserved stems and the prediction by RNAalifold from the candidate secondary structure

7) We compute the final secondary structure by merging the results of 5) and 6)

Cont 2

Results We construct one data set of three-sequence alignments one

data set of four-sequence alignments and one data set of five-sequence alignments on the basis of Rfam 80

We compute the sensitivity as the number of true positives divided by the sum of true positives + false negatives and the specificity as the number of true positives divided by the sum of true positives + false positives

Cont 1

Table 1 - Sensitivity and specificity on data set of three-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7598 7886 7497 7784

50-60 5029 5806 4859 5130

60-70 7317 6989 7279 6552

70-80 5667 4988 5587 4434

80-90 6796 6616 6782 6141

90-100 6256 5461 5944 5180

Total 6444 6291 6325 5870

Cont 2

Table 2 - Sensitivity and specificity on data set of four-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7516 8996 7460 8684

50-60 4865 5043 4348 4448

60-70 7488 5921 7456 5664

70-80 6501 5011 6190 4151

80-90 7318 6711 7151 6410

90-100 5028 5103 4870 4122

Total 6453 6312 6246 5580

Cont 3

Table 3 - Sensitivity and specificity on data set of five-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7256 8988 7253 8893

50-60 5101 5673 4920 4927

60-70 7691 7315 7634 7218

70-80 6499 4893 6209 3964

80-90 7176 6698 7171 6674

90-100 5135 5101 4732 4205

Total 6476 6445 6315 5980

Discussion In this paper we present a stem-based

method to improve the prediction of RNA secondary structure The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 8: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

Given an RNA sequence of length N Seq we build one NtimesN position matrix (denoted by MSeq) by

1) The reverse complement of Seq Seqrsquo is firstly computed from the original sequence

2) We build one NtimesN matrix containing Seqrsquo in the first row The i-th (0 le i le N-1) row contains the sequence generated from Seqrsquo by shifting i position to the left (circular left shift)

Detect possible stems in single RNA sequence

3) The position matrix MSeq is computed by comparing Seq with the matrix generated by 2) row by row 0 or 1 or -1 is assigned to the i j (0 le j le N-1) element of MSeq by comparing the j-th character of Seq with the i j (0 le j le N-1) element of the matrix of 2)

The following rules should be obeyed when two characters (b and brsquo) are compared with each other

1048698 If b equals brsquo and neither of them is lsquo_rsquo then 1 is assigned 1048698 If b or brsquo is lsquo_rsquo or both of them are lsquo_rsquo then -1 is assigned 1048698 If b does not equal to brsquo and neither of them is lsquo_rsquo then 0 is a

ssigned

Cont 1

We can detect all possible stems in an RNA sequence by scanning the position matrix row by row The key is to find all zones of continuous ldquo1rdquo in the matrix There is a one-to-one mapping between the stems in the sequence and the zones of continuous ldquo1rdquo in the position matrix

Cont 2

Example

The multiplying of the position

matrices M [i j] = M1 [i j] times M2 [i j]

0times0 = 0 0times1 = 0 0times (-1) = 0 1times1 = 1 1times (-1) = 1 (-1)times (-1) = -1

M =

Detecting conserved stems across multiple sequences

n

iiM

1

We detect conserved stems shared by all sequences in an alignment by

1) We extract all sequences from the alignment and detect all possible stems in each sequence using the approach described above

2) We select n different stems from n sequences (one stem for one sequence) Here n is the number of sequences in the alignment

3) We build the position matrix for each selected stem using the approach described above and multiply these n matrices according to the rules mentioned above

Cont 1

4) We detect conserved stems by finding zones of continuous lsquo1rsquo in the resulting matrix (M) generated by 3) There is still a one-to-one mapping between the conserved stems and the zones of continuous lsquo1rsquo in M

5) We repeat the steps 2) to 4) until all the stems detected by 1) are selected

Cont 2

Note Actually the multiplying of the position matrices can be

directly used to detect conserved stems in the RNA alignment But here we first extract all sequences from the RNA alignment and then detect conserved stems shared by all sequences The benefit for this is that some conserved stems possibly missed due to alignment errors also can be detected

Cont 3

ExampleFigure 2

The main steps are 1) We record the number of column pairs (see Figure 3) in

the conserved stem as the Signal

2) We generate a randomized alignment from the original alignment of the conserved stem

3) We detect possibly conserved stems in the new alignment using the approach mentioned above

Assessing the conserved stems using the Signal-to-Noise

4) We record the number of column pairs in the conserved stem newly detected by 3) as the Noise The Signal-to-Noise is thus computed by Signal Noise Note that we set Noise to be 1 if there are no conserved stems in the randomized alignment and we set Noise to be the maximum if there are more than one conserved stems

Cont

ExampleFigure 3

The main steps are 1) We compute a candidate secondary structure using

RNAalifold which is a popular and powerful program for predicting RNA secondary structure at present

2) We extract all sequences from the RNA alignment and detect all possibly conserved stems shared by them

3) We compute the Signal-to-Noise for each conserved stem

4) We remove incompatible conserved stems and determine the reliable conserved stems according to the Signal-to-Noise

Computing the optimized secondary structure

Here reliable conserved stems refer to following two kind of conserved stems

The conserved stems with very high Signal-to-Noise They are thought to belong to a real RNA secondary structure in our method

The conserved stems with very low Signal-to-Noise They are not thought to belong to any real RNA secondary structure in our method

Cont 1

5) We add the base pairs which are included by the first kind of reliable conserved stems but not included by the prediction by RNAalifold into the candidate secondary structure

6) We remove the base pairs which are included by both the second kind of reliable conserved stems and the prediction by RNAalifold from the candidate secondary structure

7) We compute the final secondary structure by merging the results of 5) and 6)

Cont 2

Results We construct one data set of three-sequence alignments one

data set of four-sequence alignments and one data set of five-sequence alignments on the basis of Rfam 80

We compute the sensitivity as the number of true positives divided by the sum of true positives + false negatives and the specificity as the number of true positives divided by the sum of true positives + false positives

Cont 1

Table 1 - Sensitivity and specificity on data set of three-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7598 7886 7497 7784

50-60 5029 5806 4859 5130

60-70 7317 6989 7279 6552

70-80 5667 4988 5587 4434

80-90 6796 6616 6782 6141

90-100 6256 5461 5944 5180

Total 6444 6291 6325 5870

Cont 2

Table 2 - Sensitivity and specificity on data set of four-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7516 8996 7460 8684

50-60 4865 5043 4348 4448

60-70 7488 5921 7456 5664

70-80 6501 5011 6190 4151

80-90 7318 6711 7151 6410

90-100 5028 5103 4870 4122

Total 6453 6312 6246 5580

Cont 3

Table 3 - Sensitivity and specificity on data set of five-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7256 8988 7253 8893

50-60 5101 5673 4920 4927

60-70 7691 7315 7634 7218

70-80 6499 4893 6209 3964

80-90 7176 6698 7171 6674

90-100 5135 5101 4732 4205

Total 6476 6445 6315 5980

Discussion In this paper we present a stem-based

method to improve the prediction of RNA secondary structure The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 9: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

3) The position matrix MSeq is computed by comparing Seq with the matrix generated by 2) row by row 0 or 1 or -1 is assigned to the i j (0 le j le N-1) element of MSeq by comparing the j-th character of Seq with the i j (0 le j le N-1) element of the matrix of 2)

The following rules should be obeyed when two characters (b and brsquo) are compared with each other

1048698 If b equals brsquo and neither of them is lsquo_rsquo then 1 is assigned 1048698 If b or brsquo is lsquo_rsquo or both of them are lsquo_rsquo then -1 is assigned 1048698 If b does not equal to brsquo and neither of them is lsquo_rsquo then 0 is a

ssigned

Cont 1

We can detect all possible stems in an RNA sequence by scanning the position matrix row by row The key is to find all zones of continuous ldquo1rdquo in the matrix There is a one-to-one mapping between the stems in the sequence and the zones of continuous ldquo1rdquo in the position matrix

Cont 2

Example

The multiplying of the position

matrices M [i j] = M1 [i j] times M2 [i j]

0times0 = 0 0times1 = 0 0times (-1) = 0 1times1 = 1 1times (-1) = 1 (-1)times (-1) = -1

M =

Detecting conserved stems across multiple sequences

n

iiM

1

We detect conserved stems shared by all sequences in an alignment by

1) We extract all sequences from the alignment and detect all possible stems in each sequence using the approach described above

2) We select n different stems from n sequences (one stem for one sequence) Here n is the number of sequences in the alignment

3) We build the position matrix for each selected stem using the approach described above and multiply these n matrices according to the rules mentioned above

Cont 1

4) We detect conserved stems by finding zones of continuous lsquo1rsquo in the resulting matrix (M) generated by 3) There is still a one-to-one mapping between the conserved stems and the zones of continuous lsquo1rsquo in M

5) We repeat the steps 2) to 4) until all the stems detected by 1) are selected

Cont 2

Note Actually the multiplying of the position matrices can be

directly used to detect conserved stems in the RNA alignment But here we first extract all sequences from the RNA alignment and then detect conserved stems shared by all sequences The benefit for this is that some conserved stems possibly missed due to alignment errors also can be detected

Cont 3

ExampleFigure 2

The main steps are 1) We record the number of column pairs (see Figure 3) in

the conserved stem as the Signal

2) We generate a randomized alignment from the original alignment of the conserved stem

3) We detect possibly conserved stems in the new alignment using the approach mentioned above

Assessing the conserved stems using the Signal-to-Noise

4) We record the number of column pairs in the conserved stem newly detected by 3) as the Noise The Signal-to-Noise is thus computed by Signal Noise Note that we set Noise to be 1 if there are no conserved stems in the randomized alignment and we set Noise to be the maximum if there are more than one conserved stems

Cont

ExampleFigure 3

The main steps are 1) We compute a candidate secondary structure using

RNAalifold which is a popular and powerful program for predicting RNA secondary structure at present

2) We extract all sequences from the RNA alignment and detect all possibly conserved stems shared by them

3) We compute the Signal-to-Noise for each conserved stem

4) We remove incompatible conserved stems and determine the reliable conserved stems according to the Signal-to-Noise

Computing the optimized secondary structure

Here reliable conserved stems refer to following two kind of conserved stems

The conserved stems with very high Signal-to-Noise They are thought to belong to a real RNA secondary structure in our method

The conserved stems with very low Signal-to-Noise They are not thought to belong to any real RNA secondary structure in our method

Cont 1

5) We add the base pairs which are included by the first kind of reliable conserved stems but not included by the prediction by RNAalifold into the candidate secondary structure

6) We remove the base pairs which are included by both the second kind of reliable conserved stems and the prediction by RNAalifold from the candidate secondary structure

7) We compute the final secondary structure by merging the results of 5) and 6)

Cont 2

Results We construct one data set of three-sequence alignments one

data set of four-sequence alignments and one data set of five-sequence alignments on the basis of Rfam 80

We compute the sensitivity as the number of true positives divided by the sum of true positives + false negatives and the specificity as the number of true positives divided by the sum of true positives + false positives

Cont 1

Table 1 - Sensitivity and specificity on data set of three-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7598 7886 7497 7784

50-60 5029 5806 4859 5130

60-70 7317 6989 7279 6552

70-80 5667 4988 5587 4434

80-90 6796 6616 6782 6141

90-100 6256 5461 5944 5180

Total 6444 6291 6325 5870

Cont 2

Table 2 - Sensitivity and specificity on data set of four-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7516 8996 7460 8684

50-60 4865 5043 4348 4448

60-70 7488 5921 7456 5664

70-80 6501 5011 6190 4151

80-90 7318 6711 7151 6410

90-100 5028 5103 4870 4122

Total 6453 6312 6246 5580

Cont 3

Table 3 - Sensitivity and specificity on data set of five-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7256 8988 7253 8893

50-60 5101 5673 4920 4927

60-70 7691 7315 7634 7218

70-80 6499 4893 6209 3964

80-90 7176 6698 7171 6674

90-100 5135 5101 4732 4205

Total 6476 6445 6315 5980

Discussion In this paper we present a stem-based

method to improve the prediction of RNA secondary structure The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 10: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

We can detect all possible stems in an RNA sequence by scanning the position matrix row by row The key is to find all zones of continuous ldquo1rdquo in the matrix There is a one-to-one mapping between the stems in the sequence and the zones of continuous ldquo1rdquo in the position matrix

Cont 2

Example

The multiplying of the position

matrices M [i j] = M1 [i j] times M2 [i j]

0times0 = 0 0times1 = 0 0times (-1) = 0 1times1 = 1 1times (-1) = 1 (-1)times (-1) = -1

M =

Detecting conserved stems across multiple sequences

n

iiM

1

We detect conserved stems shared by all sequences in an alignment by

1) We extract all sequences from the alignment and detect all possible stems in each sequence using the approach described above

2) We select n different stems from n sequences (one stem for one sequence) Here n is the number of sequences in the alignment

3) We build the position matrix for each selected stem using the approach described above and multiply these n matrices according to the rules mentioned above

Cont 1

4) We detect conserved stems by finding zones of continuous lsquo1rsquo in the resulting matrix (M) generated by 3) There is still a one-to-one mapping between the conserved stems and the zones of continuous lsquo1rsquo in M

5) We repeat the steps 2) to 4) until all the stems detected by 1) are selected

Cont 2

Note Actually the multiplying of the position matrices can be

directly used to detect conserved stems in the RNA alignment But here we first extract all sequences from the RNA alignment and then detect conserved stems shared by all sequences The benefit for this is that some conserved stems possibly missed due to alignment errors also can be detected

Cont 3

ExampleFigure 2

The main steps are 1) We record the number of column pairs (see Figure 3) in

the conserved stem as the Signal

2) We generate a randomized alignment from the original alignment of the conserved stem

3) We detect possibly conserved stems in the new alignment using the approach mentioned above

Assessing the conserved stems using the Signal-to-Noise

4) We record the number of column pairs in the conserved stem newly detected by 3) as the Noise The Signal-to-Noise is thus computed by Signal Noise Note that we set Noise to be 1 if there are no conserved stems in the randomized alignment and we set Noise to be the maximum if there are more than one conserved stems

Cont

ExampleFigure 3

The main steps are 1) We compute a candidate secondary structure using

RNAalifold which is a popular and powerful program for predicting RNA secondary structure at present

2) We extract all sequences from the RNA alignment and detect all possibly conserved stems shared by them

3) We compute the Signal-to-Noise for each conserved stem

4) We remove incompatible conserved stems and determine the reliable conserved stems according to the Signal-to-Noise

Computing the optimized secondary structure

Here reliable conserved stems refer to following two kind of conserved stems

The conserved stems with very high Signal-to-Noise They are thought to belong to a real RNA secondary structure in our method

The conserved stems with very low Signal-to-Noise They are not thought to belong to any real RNA secondary structure in our method

Cont 1

5) We add the base pairs which are included by the first kind of reliable conserved stems but not included by the prediction by RNAalifold into the candidate secondary structure

6) We remove the base pairs which are included by both the second kind of reliable conserved stems and the prediction by RNAalifold from the candidate secondary structure

7) We compute the final secondary structure by merging the results of 5) and 6)

Cont 2

Results We construct one data set of three-sequence alignments one

data set of four-sequence alignments and one data set of five-sequence alignments on the basis of Rfam 80

We compute the sensitivity as the number of true positives divided by the sum of true positives + false negatives and the specificity as the number of true positives divided by the sum of true positives + false positives

Cont 1

Table 1 - Sensitivity and specificity on data set of three-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7598 7886 7497 7784

50-60 5029 5806 4859 5130

60-70 7317 6989 7279 6552

70-80 5667 4988 5587 4434

80-90 6796 6616 6782 6141

90-100 6256 5461 5944 5180

Total 6444 6291 6325 5870

Cont 2

Table 2 - Sensitivity and specificity on data set of four-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7516 8996 7460 8684

50-60 4865 5043 4348 4448

60-70 7488 5921 7456 5664

70-80 6501 5011 6190 4151

80-90 7318 6711 7151 6410

90-100 5028 5103 4870 4122

Total 6453 6312 6246 5580

Cont 3

Table 3 - Sensitivity and specificity on data set of five-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7256 8988 7253 8893

50-60 5101 5673 4920 4927

60-70 7691 7315 7634 7218

70-80 6499 4893 6209 3964

80-90 7176 6698 7171 6674

90-100 5135 5101 4732 4205

Total 6476 6445 6315 5980

Discussion In this paper we present a stem-based

method to improve the prediction of RNA secondary structure The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 11: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

Example

The multiplying of the position

matrices M [i j] = M1 [i j] times M2 [i j]

0times0 = 0 0times1 = 0 0times (-1) = 0 1times1 = 1 1times (-1) = 1 (-1)times (-1) = -1

M =

Detecting conserved stems across multiple sequences

n

iiM

1

We detect conserved stems shared by all sequences in an alignment by

1) We extract all sequences from the alignment and detect all possible stems in each sequence using the approach described above

2) We select n different stems from n sequences (one stem for one sequence) Here n is the number of sequences in the alignment

3) We build the position matrix for each selected stem using the approach described above and multiply these n matrices according to the rules mentioned above

Cont 1

4) We detect conserved stems by finding zones of continuous lsquo1rsquo in the resulting matrix (M) generated by 3) There is still a one-to-one mapping between the conserved stems and the zones of continuous lsquo1rsquo in M

5) We repeat the steps 2) to 4) until all the stems detected by 1) are selected

Cont 2

Note Actually the multiplying of the position matrices can be

directly used to detect conserved stems in the RNA alignment But here we first extract all sequences from the RNA alignment and then detect conserved stems shared by all sequences The benefit for this is that some conserved stems possibly missed due to alignment errors also can be detected

Cont 3

ExampleFigure 2

The main steps are 1) We record the number of column pairs (see Figure 3) in

the conserved stem as the Signal

2) We generate a randomized alignment from the original alignment of the conserved stem

3) We detect possibly conserved stems in the new alignment using the approach mentioned above

Assessing the conserved stems using the Signal-to-Noise

4) We record the number of column pairs in the conserved stem newly detected by 3) as the Noise The Signal-to-Noise is thus computed by Signal Noise Note that we set Noise to be 1 if there are no conserved stems in the randomized alignment and we set Noise to be the maximum if there are more than one conserved stems

Cont

ExampleFigure 3

The main steps are 1) We compute a candidate secondary structure using

RNAalifold which is a popular and powerful program for predicting RNA secondary structure at present

2) We extract all sequences from the RNA alignment and detect all possibly conserved stems shared by them

3) We compute the Signal-to-Noise for each conserved stem

4) We remove incompatible conserved stems and determine the reliable conserved stems according to the Signal-to-Noise

Computing the optimized secondary structure

Here reliable conserved stems refer to following two kind of conserved stems

The conserved stems with very high Signal-to-Noise They are thought to belong to a real RNA secondary structure in our method

The conserved stems with very low Signal-to-Noise They are not thought to belong to any real RNA secondary structure in our method

Cont 1

5) We add the base pairs which are included by the first kind of reliable conserved stems but not included by the prediction by RNAalifold into the candidate secondary structure

6) We remove the base pairs which are included by both the second kind of reliable conserved stems and the prediction by RNAalifold from the candidate secondary structure

7) We compute the final secondary structure by merging the results of 5) and 6)

Cont 2

Results We construct one data set of three-sequence alignments one

data set of four-sequence alignments and one data set of five-sequence alignments on the basis of Rfam 80

We compute the sensitivity as the number of true positives divided by the sum of true positives + false negatives and the specificity as the number of true positives divided by the sum of true positives + false positives

Cont 1

Table 1 - Sensitivity and specificity on data set of three-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7598 7886 7497 7784

50-60 5029 5806 4859 5130

60-70 7317 6989 7279 6552

70-80 5667 4988 5587 4434

80-90 6796 6616 6782 6141

90-100 6256 5461 5944 5180

Total 6444 6291 6325 5870

Cont 2

Table 2 - Sensitivity and specificity on data set of four-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7516 8996 7460 8684

50-60 4865 5043 4348 4448

60-70 7488 5921 7456 5664

70-80 6501 5011 6190 4151

80-90 7318 6711 7151 6410

90-100 5028 5103 4870 4122

Total 6453 6312 6246 5580

Cont 3

Table 3 - Sensitivity and specificity on data set of five-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7256 8988 7253 8893

50-60 5101 5673 4920 4927

60-70 7691 7315 7634 7218

70-80 6499 4893 6209 3964

80-90 7176 6698 7171 6674

90-100 5135 5101 4732 4205

Total 6476 6445 6315 5980

Discussion In this paper we present a stem-based

method to improve the prediction of RNA secondary structure The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 12: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

The multiplying of the position

matrices M [i j] = M1 [i j] times M2 [i j]

0times0 = 0 0times1 = 0 0times (-1) = 0 1times1 = 1 1times (-1) = 1 (-1)times (-1) = -1

M =

Detecting conserved stems across multiple sequences

n

iiM

1

We detect conserved stems shared by all sequences in an alignment by

1) We extract all sequences from the alignment and detect all possible stems in each sequence using the approach described above

2) We select n different stems from n sequences (one stem for one sequence) Here n is the number of sequences in the alignment

3) We build the position matrix for each selected stem using the approach described above and multiply these n matrices according to the rules mentioned above

Cont 1

4) We detect conserved stems by finding zones of continuous lsquo1rsquo in the resulting matrix (M) generated by 3) There is still a one-to-one mapping between the conserved stems and the zones of continuous lsquo1rsquo in M

5) We repeat the steps 2) to 4) until all the stems detected by 1) are selected

Cont 2

Note Actually the multiplying of the position matrices can be

directly used to detect conserved stems in the RNA alignment But here we first extract all sequences from the RNA alignment and then detect conserved stems shared by all sequences The benefit for this is that some conserved stems possibly missed due to alignment errors also can be detected

Cont 3

ExampleFigure 2

The main steps are 1) We record the number of column pairs (see Figure 3) in

the conserved stem as the Signal

2) We generate a randomized alignment from the original alignment of the conserved stem

3) We detect possibly conserved stems in the new alignment using the approach mentioned above

Assessing the conserved stems using the Signal-to-Noise

4) We record the number of column pairs in the conserved stem newly detected by 3) as the Noise The Signal-to-Noise is thus computed by Signal Noise Note that we set Noise to be 1 if there are no conserved stems in the randomized alignment and we set Noise to be the maximum if there are more than one conserved stems

Cont

ExampleFigure 3

The main steps are 1) We compute a candidate secondary structure using

RNAalifold which is a popular and powerful program for predicting RNA secondary structure at present

2) We extract all sequences from the RNA alignment and detect all possibly conserved stems shared by them

3) We compute the Signal-to-Noise for each conserved stem

4) We remove incompatible conserved stems and determine the reliable conserved stems according to the Signal-to-Noise

Computing the optimized secondary structure

Here reliable conserved stems refer to following two kind of conserved stems

The conserved stems with very high Signal-to-Noise They are thought to belong to a real RNA secondary structure in our method

The conserved stems with very low Signal-to-Noise They are not thought to belong to any real RNA secondary structure in our method

Cont 1

5) We add the base pairs which are included by the first kind of reliable conserved stems but not included by the prediction by RNAalifold into the candidate secondary structure

6) We remove the base pairs which are included by both the second kind of reliable conserved stems and the prediction by RNAalifold from the candidate secondary structure

7) We compute the final secondary structure by merging the results of 5) and 6)

Cont 2

Results We construct one data set of three-sequence alignments one

data set of four-sequence alignments and one data set of five-sequence alignments on the basis of Rfam 80

We compute the sensitivity as the number of true positives divided by the sum of true positives + false negatives and the specificity as the number of true positives divided by the sum of true positives + false positives

Cont 1

Table 1 - Sensitivity and specificity on data set of three-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7598 7886 7497 7784

50-60 5029 5806 4859 5130

60-70 7317 6989 7279 6552

70-80 5667 4988 5587 4434

80-90 6796 6616 6782 6141

90-100 6256 5461 5944 5180

Total 6444 6291 6325 5870

Cont 2

Table 2 - Sensitivity and specificity on data set of four-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7516 8996 7460 8684

50-60 4865 5043 4348 4448

60-70 7488 5921 7456 5664

70-80 6501 5011 6190 4151

80-90 7318 6711 7151 6410

90-100 5028 5103 4870 4122

Total 6453 6312 6246 5580

Cont 3

Table 3 - Sensitivity and specificity on data set of five-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7256 8988 7253 8893

50-60 5101 5673 4920 4927

60-70 7691 7315 7634 7218

70-80 6499 4893 6209 3964

80-90 7176 6698 7171 6674

90-100 5135 5101 4732 4205

Total 6476 6445 6315 5980

Discussion In this paper we present a stem-based

method to improve the prediction of RNA secondary structure The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 13: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

We detect conserved stems shared by all sequences in an alignment by

1) We extract all sequences from the alignment and detect all possible stems in each sequence using the approach described above

2) We select n different stems from n sequences (one stem for one sequence) Here n is the number of sequences in the alignment

3) We build the position matrix for each selected stem using the approach described above and multiply these n matrices according to the rules mentioned above

Cont 1

4) We detect conserved stems by finding zones of continuous lsquo1rsquo in the resulting matrix (M) generated by 3) There is still a one-to-one mapping between the conserved stems and the zones of continuous lsquo1rsquo in M

5) We repeat the steps 2) to 4) until all the stems detected by 1) are selected

Cont 2

Note Actually the multiplying of the position matrices can be

directly used to detect conserved stems in the RNA alignment But here we first extract all sequences from the RNA alignment and then detect conserved stems shared by all sequences The benefit for this is that some conserved stems possibly missed due to alignment errors also can be detected

Cont 3

ExampleFigure 2

The main steps are 1) We record the number of column pairs (see Figure 3) in

the conserved stem as the Signal

2) We generate a randomized alignment from the original alignment of the conserved stem

3) We detect possibly conserved stems in the new alignment using the approach mentioned above

Assessing the conserved stems using the Signal-to-Noise

4) We record the number of column pairs in the conserved stem newly detected by 3) as the Noise The Signal-to-Noise is thus computed by Signal Noise Note that we set Noise to be 1 if there are no conserved stems in the randomized alignment and we set Noise to be the maximum if there are more than one conserved stems

Cont

ExampleFigure 3

The main steps are 1) We compute a candidate secondary structure using

RNAalifold which is a popular and powerful program for predicting RNA secondary structure at present

2) We extract all sequences from the RNA alignment and detect all possibly conserved stems shared by them

3) We compute the Signal-to-Noise for each conserved stem

4) We remove incompatible conserved stems and determine the reliable conserved stems according to the Signal-to-Noise

Computing the optimized secondary structure

Here reliable conserved stems refer to following two kind of conserved stems

The conserved stems with very high Signal-to-Noise They are thought to belong to a real RNA secondary structure in our method

The conserved stems with very low Signal-to-Noise They are not thought to belong to any real RNA secondary structure in our method

Cont 1

5) We add the base pairs which are included by the first kind of reliable conserved stems but not included by the prediction by RNAalifold into the candidate secondary structure

6) We remove the base pairs which are included by both the second kind of reliable conserved stems and the prediction by RNAalifold from the candidate secondary structure

7) We compute the final secondary structure by merging the results of 5) and 6)

Cont 2

Results We construct one data set of three-sequence alignments one

data set of four-sequence alignments and one data set of five-sequence alignments on the basis of Rfam 80

We compute the sensitivity as the number of true positives divided by the sum of true positives + false negatives and the specificity as the number of true positives divided by the sum of true positives + false positives

Cont 1

Table 1 - Sensitivity and specificity on data set of three-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7598 7886 7497 7784

50-60 5029 5806 4859 5130

60-70 7317 6989 7279 6552

70-80 5667 4988 5587 4434

80-90 6796 6616 6782 6141

90-100 6256 5461 5944 5180

Total 6444 6291 6325 5870

Cont 2

Table 2 - Sensitivity and specificity on data set of four-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7516 8996 7460 8684

50-60 4865 5043 4348 4448

60-70 7488 5921 7456 5664

70-80 6501 5011 6190 4151

80-90 7318 6711 7151 6410

90-100 5028 5103 4870 4122

Total 6453 6312 6246 5580

Cont 3

Table 3 - Sensitivity and specificity on data set of five-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7256 8988 7253 8893

50-60 5101 5673 4920 4927

60-70 7691 7315 7634 7218

70-80 6499 4893 6209 3964

80-90 7176 6698 7171 6674

90-100 5135 5101 4732 4205

Total 6476 6445 6315 5980

Discussion In this paper we present a stem-based

method to improve the prediction of RNA secondary structure The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 14: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

4) We detect conserved stems by finding zones of continuous lsquo1rsquo in the resulting matrix (M) generated by 3) There is still a one-to-one mapping between the conserved stems and the zones of continuous lsquo1rsquo in M

5) We repeat the steps 2) to 4) until all the stems detected by 1) are selected

Cont 2

Note Actually the multiplying of the position matrices can be

directly used to detect conserved stems in the RNA alignment But here we first extract all sequences from the RNA alignment and then detect conserved stems shared by all sequences The benefit for this is that some conserved stems possibly missed due to alignment errors also can be detected

Cont 3

ExampleFigure 2

The main steps are 1) We record the number of column pairs (see Figure 3) in

the conserved stem as the Signal

2) We generate a randomized alignment from the original alignment of the conserved stem

3) We detect possibly conserved stems in the new alignment using the approach mentioned above

Assessing the conserved stems using the Signal-to-Noise

4) We record the number of column pairs in the conserved stem newly detected by 3) as the Noise The Signal-to-Noise is thus computed by Signal Noise Note that we set Noise to be 1 if there are no conserved stems in the randomized alignment and we set Noise to be the maximum if there are more than one conserved stems

Cont

ExampleFigure 3

The main steps are 1) We compute a candidate secondary structure using

RNAalifold which is a popular and powerful program for predicting RNA secondary structure at present

2) We extract all sequences from the RNA alignment and detect all possibly conserved stems shared by them

3) We compute the Signal-to-Noise for each conserved stem

4) We remove incompatible conserved stems and determine the reliable conserved stems according to the Signal-to-Noise

Computing the optimized secondary structure

Here reliable conserved stems refer to following two kind of conserved stems

The conserved stems with very high Signal-to-Noise They are thought to belong to a real RNA secondary structure in our method

The conserved stems with very low Signal-to-Noise They are not thought to belong to any real RNA secondary structure in our method

Cont 1

5) We add the base pairs which are included by the first kind of reliable conserved stems but not included by the prediction by RNAalifold into the candidate secondary structure

6) We remove the base pairs which are included by both the second kind of reliable conserved stems and the prediction by RNAalifold from the candidate secondary structure

7) We compute the final secondary structure by merging the results of 5) and 6)

Cont 2

Results We construct one data set of three-sequence alignments one

data set of four-sequence alignments and one data set of five-sequence alignments on the basis of Rfam 80

We compute the sensitivity as the number of true positives divided by the sum of true positives + false negatives and the specificity as the number of true positives divided by the sum of true positives + false positives

Cont 1

Table 1 - Sensitivity and specificity on data set of three-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7598 7886 7497 7784

50-60 5029 5806 4859 5130

60-70 7317 6989 7279 6552

70-80 5667 4988 5587 4434

80-90 6796 6616 6782 6141

90-100 6256 5461 5944 5180

Total 6444 6291 6325 5870

Cont 2

Table 2 - Sensitivity and specificity on data set of four-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7516 8996 7460 8684

50-60 4865 5043 4348 4448

60-70 7488 5921 7456 5664

70-80 6501 5011 6190 4151

80-90 7318 6711 7151 6410

90-100 5028 5103 4870 4122

Total 6453 6312 6246 5580

Cont 3

Table 3 - Sensitivity and specificity on data set of five-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7256 8988 7253 8893

50-60 5101 5673 4920 4927

60-70 7691 7315 7634 7218

70-80 6499 4893 6209 3964

80-90 7176 6698 7171 6674

90-100 5135 5101 4732 4205

Total 6476 6445 6315 5980

Discussion In this paper we present a stem-based

method to improve the prediction of RNA secondary structure The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 15: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

Note Actually the multiplying of the position matrices can be

directly used to detect conserved stems in the RNA alignment But here we first extract all sequences from the RNA alignment and then detect conserved stems shared by all sequences The benefit for this is that some conserved stems possibly missed due to alignment errors also can be detected

Cont 3

ExampleFigure 2

The main steps are 1) We record the number of column pairs (see Figure 3) in

the conserved stem as the Signal

2) We generate a randomized alignment from the original alignment of the conserved stem

3) We detect possibly conserved stems in the new alignment using the approach mentioned above

Assessing the conserved stems using the Signal-to-Noise

4) We record the number of column pairs in the conserved stem newly detected by 3) as the Noise The Signal-to-Noise is thus computed by Signal Noise Note that we set Noise to be 1 if there are no conserved stems in the randomized alignment and we set Noise to be the maximum if there are more than one conserved stems

Cont

ExampleFigure 3

The main steps are 1) We compute a candidate secondary structure using

RNAalifold which is a popular and powerful program for predicting RNA secondary structure at present

2) We extract all sequences from the RNA alignment and detect all possibly conserved stems shared by them

3) We compute the Signal-to-Noise for each conserved stem

4) We remove incompatible conserved stems and determine the reliable conserved stems according to the Signal-to-Noise

Computing the optimized secondary structure

Here reliable conserved stems refer to following two kind of conserved stems

The conserved stems with very high Signal-to-Noise They are thought to belong to a real RNA secondary structure in our method

The conserved stems with very low Signal-to-Noise They are not thought to belong to any real RNA secondary structure in our method

Cont 1

5) We add the base pairs which are included by the first kind of reliable conserved stems but not included by the prediction by RNAalifold into the candidate secondary structure

6) We remove the base pairs which are included by both the second kind of reliable conserved stems and the prediction by RNAalifold from the candidate secondary structure

7) We compute the final secondary structure by merging the results of 5) and 6)

Cont 2

Results We construct one data set of three-sequence alignments one

data set of four-sequence alignments and one data set of five-sequence alignments on the basis of Rfam 80

We compute the sensitivity as the number of true positives divided by the sum of true positives + false negatives and the specificity as the number of true positives divided by the sum of true positives + false positives

Cont 1

Table 1 - Sensitivity and specificity on data set of three-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7598 7886 7497 7784

50-60 5029 5806 4859 5130

60-70 7317 6989 7279 6552

70-80 5667 4988 5587 4434

80-90 6796 6616 6782 6141

90-100 6256 5461 5944 5180

Total 6444 6291 6325 5870

Cont 2

Table 2 - Sensitivity and specificity on data set of four-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7516 8996 7460 8684

50-60 4865 5043 4348 4448

60-70 7488 5921 7456 5664

70-80 6501 5011 6190 4151

80-90 7318 6711 7151 6410

90-100 5028 5103 4870 4122

Total 6453 6312 6246 5580

Cont 3

Table 3 - Sensitivity and specificity on data set of five-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7256 8988 7253 8893

50-60 5101 5673 4920 4927

60-70 7691 7315 7634 7218

70-80 6499 4893 6209 3964

80-90 7176 6698 7171 6674

90-100 5135 5101 4732 4205

Total 6476 6445 6315 5980

Discussion In this paper we present a stem-based

method to improve the prediction of RNA secondary structure The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 16: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

ExampleFigure 2

The main steps are 1) We record the number of column pairs (see Figure 3) in

the conserved stem as the Signal

2) We generate a randomized alignment from the original alignment of the conserved stem

3) We detect possibly conserved stems in the new alignment using the approach mentioned above

Assessing the conserved stems using the Signal-to-Noise

4) We record the number of column pairs in the conserved stem newly detected by 3) as the Noise The Signal-to-Noise is thus computed by Signal Noise Note that we set Noise to be 1 if there are no conserved stems in the randomized alignment and we set Noise to be the maximum if there are more than one conserved stems

Cont

ExampleFigure 3

The main steps are 1) We compute a candidate secondary structure using

RNAalifold which is a popular and powerful program for predicting RNA secondary structure at present

2) We extract all sequences from the RNA alignment and detect all possibly conserved stems shared by them

3) We compute the Signal-to-Noise for each conserved stem

4) We remove incompatible conserved stems and determine the reliable conserved stems according to the Signal-to-Noise

Computing the optimized secondary structure

Here reliable conserved stems refer to following two kind of conserved stems

The conserved stems with very high Signal-to-Noise They are thought to belong to a real RNA secondary structure in our method

The conserved stems with very low Signal-to-Noise They are not thought to belong to any real RNA secondary structure in our method

Cont 1

5) We add the base pairs which are included by the first kind of reliable conserved stems but not included by the prediction by RNAalifold into the candidate secondary structure

6) We remove the base pairs which are included by both the second kind of reliable conserved stems and the prediction by RNAalifold from the candidate secondary structure

7) We compute the final secondary structure by merging the results of 5) and 6)

Cont 2

Results We construct one data set of three-sequence alignments one

data set of four-sequence alignments and one data set of five-sequence alignments on the basis of Rfam 80

We compute the sensitivity as the number of true positives divided by the sum of true positives + false negatives and the specificity as the number of true positives divided by the sum of true positives + false positives

Cont 1

Table 1 - Sensitivity and specificity on data set of three-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7598 7886 7497 7784

50-60 5029 5806 4859 5130

60-70 7317 6989 7279 6552

70-80 5667 4988 5587 4434

80-90 6796 6616 6782 6141

90-100 6256 5461 5944 5180

Total 6444 6291 6325 5870

Cont 2

Table 2 - Sensitivity and specificity on data set of four-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7516 8996 7460 8684

50-60 4865 5043 4348 4448

60-70 7488 5921 7456 5664

70-80 6501 5011 6190 4151

80-90 7318 6711 7151 6410

90-100 5028 5103 4870 4122

Total 6453 6312 6246 5580

Cont 3

Table 3 - Sensitivity and specificity on data set of five-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7256 8988 7253 8893

50-60 5101 5673 4920 4927

60-70 7691 7315 7634 7218

70-80 6499 4893 6209 3964

80-90 7176 6698 7171 6674

90-100 5135 5101 4732 4205

Total 6476 6445 6315 5980

Discussion In this paper we present a stem-based

method to improve the prediction of RNA secondary structure The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 17: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

The main steps are 1) We record the number of column pairs (see Figure 3) in

the conserved stem as the Signal

2) We generate a randomized alignment from the original alignment of the conserved stem

3) We detect possibly conserved stems in the new alignment using the approach mentioned above

Assessing the conserved stems using the Signal-to-Noise

4) We record the number of column pairs in the conserved stem newly detected by 3) as the Noise The Signal-to-Noise is thus computed by Signal Noise Note that we set Noise to be 1 if there are no conserved stems in the randomized alignment and we set Noise to be the maximum if there are more than one conserved stems

Cont

ExampleFigure 3

The main steps are 1) We compute a candidate secondary structure using

RNAalifold which is a popular and powerful program for predicting RNA secondary structure at present

2) We extract all sequences from the RNA alignment and detect all possibly conserved stems shared by them

3) We compute the Signal-to-Noise for each conserved stem

4) We remove incompatible conserved stems and determine the reliable conserved stems according to the Signal-to-Noise

Computing the optimized secondary structure

Here reliable conserved stems refer to following two kind of conserved stems

The conserved stems with very high Signal-to-Noise They are thought to belong to a real RNA secondary structure in our method

The conserved stems with very low Signal-to-Noise They are not thought to belong to any real RNA secondary structure in our method

Cont 1

5) We add the base pairs which are included by the first kind of reliable conserved stems but not included by the prediction by RNAalifold into the candidate secondary structure

6) We remove the base pairs which are included by both the second kind of reliable conserved stems and the prediction by RNAalifold from the candidate secondary structure

7) We compute the final secondary structure by merging the results of 5) and 6)

Cont 2

Results We construct one data set of three-sequence alignments one

data set of four-sequence alignments and one data set of five-sequence alignments on the basis of Rfam 80

We compute the sensitivity as the number of true positives divided by the sum of true positives + false negatives and the specificity as the number of true positives divided by the sum of true positives + false positives

Cont 1

Table 1 - Sensitivity and specificity on data set of three-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7598 7886 7497 7784

50-60 5029 5806 4859 5130

60-70 7317 6989 7279 6552

70-80 5667 4988 5587 4434

80-90 6796 6616 6782 6141

90-100 6256 5461 5944 5180

Total 6444 6291 6325 5870

Cont 2

Table 2 - Sensitivity and specificity on data set of four-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7516 8996 7460 8684

50-60 4865 5043 4348 4448

60-70 7488 5921 7456 5664

70-80 6501 5011 6190 4151

80-90 7318 6711 7151 6410

90-100 5028 5103 4870 4122

Total 6453 6312 6246 5580

Cont 3

Table 3 - Sensitivity and specificity on data set of five-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7256 8988 7253 8893

50-60 5101 5673 4920 4927

60-70 7691 7315 7634 7218

70-80 6499 4893 6209 3964

80-90 7176 6698 7171 6674

90-100 5135 5101 4732 4205

Total 6476 6445 6315 5980

Discussion In this paper we present a stem-based

method to improve the prediction of RNA secondary structure The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 18: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

4) We record the number of column pairs in the conserved stem newly detected by 3) as the Noise The Signal-to-Noise is thus computed by Signal Noise Note that we set Noise to be 1 if there are no conserved stems in the randomized alignment and we set Noise to be the maximum if there are more than one conserved stems

Cont

ExampleFigure 3

The main steps are 1) We compute a candidate secondary structure using

RNAalifold which is a popular and powerful program for predicting RNA secondary structure at present

2) We extract all sequences from the RNA alignment and detect all possibly conserved stems shared by them

3) We compute the Signal-to-Noise for each conserved stem

4) We remove incompatible conserved stems and determine the reliable conserved stems according to the Signal-to-Noise

Computing the optimized secondary structure

Here reliable conserved stems refer to following two kind of conserved stems

The conserved stems with very high Signal-to-Noise They are thought to belong to a real RNA secondary structure in our method

The conserved stems with very low Signal-to-Noise They are not thought to belong to any real RNA secondary structure in our method

Cont 1

5) We add the base pairs which are included by the first kind of reliable conserved stems but not included by the prediction by RNAalifold into the candidate secondary structure

6) We remove the base pairs which are included by both the second kind of reliable conserved stems and the prediction by RNAalifold from the candidate secondary structure

7) We compute the final secondary structure by merging the results of 5) and 6)

Cont 2

Results We construct one data set of three-sequence alignments one

data set of four-sequence alignments and one data set of five-sequence alignments on the basis of Rfam 80

We compute the sensitivity as the number of true positives divided by the sum of true positives + false negatives and the specificity as the number of true positives divided by the sum of true positives + false positives

Cont 1

Table 1 - Sensitivity and specificity on data set of three-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7598 7886 7497 7784

50-60 5029 5806 4859 5130

60-70 7317 6989 7279 6552

70-80 5667 4988 5587 4434

80-90 6796 6616 6782 6141

90-100 6256 5461 5944 5180

Total 6444 6291 6325 5870

Cont 2

Table 2 - Sensitivity and specificity on data set of four-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7516 8996 7460 8684

50-60 4865 5043 4348 4448

60-70 7488 5921 7456 5664

70-80 6501 5011 6190 4151

80-90 7318 6711 7151 6410

90-100 5028 5103 4870 4122

Total 6453 6312 6246 5580

Cont 3

Table 3 - Sensitivity and specificity on data set of five-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7256 8988 7253 8893

50-60 5101 5673 4920 4927

60-70 7691 7315 7634 7218

70-80 6499 4893 6209 3964

80-90 7176 6698 7171 6674

90-100 5135 5101 4732 4205

Total 6476 6445 6315 5980

Discussion In this paper we present a stem-based

method to improve the prediction of RNA secondary structure The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 19: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

ExampleFigure 3

The main steps are 1) We compute a candidate secondary structure using

RNAalifold which is a popular and powerful program for predicting RNA secondary structure at present

2) We extract all sequences from the RNA alignment and detect all possibly conserved stems shared by them

3) We compute the Signal-to-Noise for each conserved stem

4) We remove incompatible conserved stems and determine the reliable conserved stems according to the Signal-to-Noise

Computing the optimized secondary structure

Here reliable conserved stems refer to following two kind of conserved stems

The conserved stems with very high Signal-to-Noise They are thought to belong to a real RNA secondary structure in our method

The conserved stems with very low Signal-to-Noise They are not thought to belong to any real RNA secondary structure in our method

Cont 1

5) We add the base pairs which are included by the first kind of reliable conserved stems but not included by the prediction by RNAalifold into the candidate secondary structure

6) We remove the base pairs which are included by both the second kind of reliable conserved stems and the prediction by RNAalifold from the candidate secondary structure

7) We compute the final secondary structure by merging the results of 5) and 6)

Cont 2

Results We construct one data set of three-sequence alignments one

data set of four-sequence alignments and one data set of five-sequence alignments on the basis of Rfam 80

We compute the sensitivity as the number of true positives divided by the sum of true positives + false negatives and the specificity as the number of true positives divided by the sum of true positives + false positives

Cont 1

Table 1 - Sensitivity and specificity on data set of three-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7598 7886 7497 7784

50-60 5029 5806 4859 5130

60-70 7317 6989 7279 6552

70-80 5667 4988 5587 4434

80-90 6796 6616 6782 6141

90-100 6256 5461 5944 5180

Total 6444 6291 6325 5870

Cont 2

Table 2 - Sensitivity and specificity on data set of four-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7516 8996 7460 8684

50-60 4865 5043 4348 4448

60-70 7488 5921 7456 5664

70-80 6501 5011 6190 4151

80-90 7318 6711 7151 6410

90-100 5028 5103 4870 4122

Total 6453 6312 6246 5580

Cont 3

Table 3 - Sensitivity and specificity on data set of five-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7256 8988 7253 8893

50-60 5101 5673 4920 4927

60-70 7691 7315 7634 7218

70-80 6499 4893 6209 3964

80-90 7176 6698 7171 6674

90-100 5135 5101 4732 4205

Total 6476 6445 6315 5980

Discussion In this paper we present a stem-based

method to improve the prediction of RNA secondary structure The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 20: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

The main steps are 1) We compute a candidate secondary structure using

RNAalifold which is a popular and powerful program for predicting RNA secondary structure at present

2) We extract all sequences from the RNA alignment and detect all possibly conserved stems shared by them

3) We compute the Signal-to-Noise for each conserved stem

4) We remove incompatible conserved stems and determine the reliable conserved stems according to the Signal-to-Noise

Computing the optimized secondary structure

Here reliable conserved stems refer to following two kind of conserved stems

The conserved stems with very high Signal-to-Noise They are thought to belong to a real RNA secondary structure in our method

The conserved stems with very low Signal-to-Noise They are not thought to belong to any real RNA secondary structure in our method

Cont 1

5) We add the base pairs which are included by the first kind of reliable conserved stems but not included by the prediction by RNAalifold into the candidate secondary structure

6) We remove the base pairs which are included by both the second kind of reliable conserved stems and the prediction by RNAalifold from the candidate secondary structure

7) We compute the final secondary structure by merging the results of 5) and 6)

Cont 2

Results We construct one data set of three-sequence alignments one

data set of four-sequence alignments and one data set of five-sequence alignments on the basis of Rfam 80

We compute the sensitivity as the number of true positives divided by the sum of true positives + false negatives and the specificity as the number of true positives divided by the sum of true positives + false positives

Cont 1

Table 1 - Sensitivity and specificity on data set of three-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7598 7886 7497 7784

50-60 5029 5806 4859 5130

60-70 7317 6989 7279 6552

70-80 5667 4988 5587 4434

80-90 6796 6616 6782 6141

90-100 6256 5461 5944 5180

Total 6444 6291 6325 5870

Cont 2

Table 2 - Sensitivity and specificity on data set of four-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7516 8996 7460 8684

50-60 4865 5043 4348 4448

60-70 7488 5921 7456 5664

70-80 6501 5011 6190 4151

80-90 7318 6711 7151 6410

90-100 5028 5103 4870 4122

Total 6453 6312 6246 5580

Cont 3

Table 3 - Sensitivity and specificity on data set of five-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7256 8988 7253 8893

50-60 5101 5673 4920 4927

60-70 7691 7315 7634 7218

70-80 6499 4893 6209 3964

80-90 7176 6698 7171 6674

90-100 5135 5101 4732 4205

Total 6476 6445 6315 5980

Discussion In this paper we present a stem-based

method to improve the prediction of RNA secondary structure The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 21: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

Here reliable conserved stems refer to following two kind of conserved stems

The conserved stems with very high Signal-to-Noise They are thought to belong to a real RNA secondary structure in our method

The conserved stems with very low Signal-to-Noise They are not thought to belong to any real RNA secondary structure in our method

Cont 1

5) We add the base pairs which are included by the first kind of reliable conserved stems but not included by the prediction by RNAalifold into the candidate secondary structure

6) We remove the base pairs which are included by both the second kind of reliable conserved stems and the prediction by RNAalifold from the candidate secondary structure

7) We compute the final secondary structure by merging the results of 5) and 6)

Cont 2

Results We construct one data set of three-sequence alignments one

data set of four-sequence alignments and one data set of five-sequence alignments on the basis of Rfam 80

We compute the sensitivity as the number of true positives divided by the sum of true positives + false negatives and the specificity as the number of true positives divided by the sum of true positives + false positives

Cont 1

Table 1 - Sensitivity and specificity on data set of three-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7598 7886 7497 7784

50-60 5029 5806 4859 5130

60-70 7317 6989 7279 6552

70-80 5667 4988 5587 4434

80-90 6796 6616 6782 6141

90-100 6256 5461 5944 5180

Total 6444 6291 6325 5870

Cont 2

Table 2 - Sensitivity and specificity on data set of four-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7516 8996 7460 8684

50-60 4865 5043 4348 4448

60-70 7488 5921 7456 5664

70-80 6501 5011 6190 4151

80-90 7318 6711 7151 6410

90-100 5028 5103 4870 4122

Total 6453 6312 6246 5580

Cont 3

Table 3 - Sensitivity and specificity on data set of five-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7256 8988 7253 8893

50-60 5101 5673 4920 4927

60-70 7691 7315 7634 7218

70-80 6499 4893 6209 3964

80-90 7176 6698 7171 6674

90-100 5135 5101 4732 4205

Total 6476 6445 6315 5980

Discussion In this paper we present a stem-based

method to improve the prediction of RNA secondary structure The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 22: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

5) We add the base pairs which are included by the first kind of reliable conserved stems but not included by the prediction by RNAalifold into the candidate secondary structure

6) We remove the base pairs which are included by both the second kind of reliable conserved stems and the prediction by RNAalifold from the candidate secondary structure

7) We compute the final secondary structure by merging the results of 5) and 6)

Cont 2

Results We construct one data set of three-sequence alignments one

data set of four-sequence alignments and one data set of five-sequence alignments on the basis of Rfam 80

We compute the sensitivity as the number of true positives divided by the sum of true positives + false negatives and the specificity as the number of true positives divided by the sum of true positives + false positives

Cont 1

Table 1 - Sensitivity and specificity on data set of three-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7598 7886 7497 7784

50-60 5029 5806 4859 5130

60-70 7317 6989 7279 6552

70-80 5667 4988 5587 4434

80-90 6796 6616 6782 6141

90-100 6256 5461 5944 5180

Total 6444 6291 6325 5870

Cont 2

Table 2 - Sensitivity and specificity on data set of four-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7516 8996 7460 8684

50-60 4865 5043 4348 4448

60-70 7488 5921 7456 5664

70-80 6501 5011 6190 4151

80-90 7318 6711 7151 6410

90-100 5028 5103 4870 4122

Total 6453 6312 6246 5580

Cont 3

Table 3 - Sensitivity and specificity on data set of five-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7256 8988 7253 8893

50-60 5101 5673 4920 4927

60-70 7691 7315 7634 7218

70-80 6499 4893 6209 3964

80-90 7176 6698 7171 6674

90-100 5135 5101 4732 4205

Total 6476 6445 6315 5980

Discussion In this paper we present a stem-based

method to improve the prediction of RNA secondary structure The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 23: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

Results We construct one data set of three-sequence alignments one

data set of four-sequence alignments and one data set of five-sequence alignments on the basis of Rfam 80

We compute the sensitivity as the number of true positives divided by the sum of true positives + false negatives and the specificity as the number of true positives divided by the sum of true positives + false positives

Cont 1

Table 1 - Sensitivity and specificity on data set of three-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7598 7886 7497 7784

50-60 5029 5806 4859 5130

60-70 7317 6989 7279 6552

70-80 5667 4988 5587 4434

80-90 6796 6616 6782 6141

90-100 6256 5461 5944 5180

Total 6444 6291 6325 5870

Cont 2

Table 2 - Sensitivity and specificity on data set of four-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7516 8996 7460 8684

50-60 4865 5043 4348 4448

60-70 7488 5921 7456 5664

70-80 6501 5011 6190 4151

80-90 7318 6711 7151 6410

90-100 5028 5103 4870 4122

Total 6453 6312 6246 5580

Cont 3

Table 3 - Sensitivity and specificity on data set of five-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7256 8988 7253 8893

50-60 5101 5673 4920 4927

60-70 7691 7315 7634 7218

70-80 6499 4893 6209 3964

80-90 7176 6698 7171 6674

90-100 5135 5101 4732 4205

Total 6476 6445 6315 5980

Discussion In this paper we present a stem-based

method to improve the prediction of RNA secondary structure The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 24: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

Cont 1

Table 1 - Sensitivity and specificity on data set of three-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7598 7886 7497 7784

50-60 5029 5806 4859 5130

60-70 7317 6989 7279 6552

70-80 5667 4988 5587 4434

80-90 6796 6616 6782 6141

90-100 6256 5461 5944 5180

Total 6444 6291 6325 5870

Cont 2

Table 2 - Sensitivity and specificity on data set of four-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7516 8996 7460 8684

50-60 4865 5043 4348 4448

60-70 7488 5921 7456 5664

70-80 6501 5011 6190 4151

80-90 7318 6711 7151 6410

90-100 5028 5103 4870 4122

Total 6453 6312 6246 5580

Cont 3

Table 3 - Sensitivity and specificity on data set of five-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7256 8988 7253 8893

50-60 5101 5673 4920 4927

60-70 7691 7315 7634 7218

70-80 6499 4893 6209 3964

80-90 7176 6698 7171 6674

90-100 5135 5101 4732 4205

Total 6476 6445 6315 5980

Discussion In this paper we present a stem-based

method to improve the prediction of RNA secondary structure The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 25: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

Cont 2

Table 2 - Sensitivity and specificity on data set of four-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7516 8996 7460 8684

50-60 4865 5043 4348 4448

60-70 7488 5921 7456 5664

70-80 6501 5011 6190 4151

80-90 7318 6711 7151 6410

90-100 5028 5103 4870 4122

Total 6453 6312 6246 5580

Cont 3

Table 3 - Sensitivity and specificity on data set of five-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7256 8988 7253 8893

50-60 5101 5673 4920 4927

60-70 7691 7315 7634 7218

70-80 6499 4893 6209 3964

80-90 7176 6698 7171 6674

90-100 5135 5101 4732 4205

Total 6476 6445 6315 5980

Discussion In this paper we present a stem-based

method to improve the prediction of RNA secondary structure The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 26: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

Cont 3

Table 3 - Sensitivity and specificity on data set of five-sequence alignmentsId percentage identity Se sensitivity Sp specificity

Id () Se () Sp () SeRNAalifold () SpRNAalifold ()

lt50 7256 8988 7253 8893

50-60 5101 5673 4920 4927

60-70 7691 7315 7634 7218

70-80 6499 4893 6209 3964

80-90 7176 6698 7171 6674

90-100 5135 5101 4732 4205

Total 6476 6445 6315 5980

Discussion In this paper we present a stem-based

method to improve the prediction of RNA secondary structure The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 27: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

Discussion In this paper we present a stem-based

method to improve the prediction of RNA secondary structure The central idea of our method is to detect and assess conserved stems shared by all sequences in the input RNA alignment and then to compute the optimized secondary structure by incorporating reliable conserved stems with predictions by RNAalifold

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 28: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

Cont 1 Our method improves RNAalifold by two major

means 1) to add some conserved stems which are possibly missed by RNAalifold for improving the sensitivity 2) to remove some conserved stems which are possibly mistaken by RNAalifold for improving the specificity

We tested our method on data sets of RNA

alignments taken from the Rfam database Experimental results suggest that our method can predict RNA secondary structure with much better performance than RNAalifold

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 29: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

Cont 2 In future work our method can be improved by

1) To compute the Signal-to-Noise in more effective ways

2) To determine the reliable conserved stems by more intelligent approaches

3) To devise new parallel algorithms to speed up the method

4) Further benchmark for evaluating the performance of the method

Thank you very much

Please contact xyfangnudteducn for any

questions

Page 30: Improving the  prediction of RNA secondary structure by detecting and assessing conserved stems

Thank you very much

Please contact xyfangnudteducn for any

questions