classification of coding and non-coding rna in rna...

CLASSIFICATION OF CODING AND NON-CODING

RNA IN RNA-SEQ DATA

by

Hisanaga Mark Okada

B.Sc., Simon Fraser University, 2008

a Thesis submitted in partial fulfillment

of the requirements for the degree of

Master of Science

in the School

of

Computing Science

c© Hisanaga Mark Okada 2011

SIMON FRASER UNIVERSITY

Spring 2011

All rights reserved. However, in accordance with the Copyright Act of

Canada, this work may be reproduced without authorization under the

conditions for Fair Dealing. Therefore, limited reproduction of this

work for the purposes of private study, research, criticism, review and

news reporting is likely to be in accordance with the law, particularly

if cited appropriately.

APPROVAL

Name: Hisanaga Mark Okada

Degree: Master of Science

Title of Thesis: Classification of coding and non-coding RNA in RNA-Seq

data

Examining Committee: Dr. Anoop Sarkar

Associate Professor, Computing Science

Simon Fraser University

Chair

Dr. Martin Ester

Professor, Computing Science


Senior Supervisor

Dr. Cenk Sahinalp



Supervisor

Dr. Kay Wiese



Examiner

Date Approved: February 28, 2011

11

APPROVAL

Name: Hisanaga Mark Okada

Degree: Master of Science

Title of Thesis: Classification of coding and non-coding RNA in RNA-Seq

data

Examining Committee: Dr. Anoop Sarkar



Chair

Dr. Martin Ester



Senior Supervisor

Dr. Cenk Sahinalp



Supervisor

Dr. Kay Wiese



Examiner

Date Approved: February 28, 2011

11

Last revision: Spring 09

Declaration of Partial Copyright Licence The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users.

The author has further granted permission to Simon Fraser University to keep or make a digital copy for use in its circulating collection (currently available to the public at the “Institutional Repository” link of the SFU Library website <www.lib.sfu.ca> at: <http://ir.lib.sfu.ca/handle/1892/112>) and, without changing the content, to translate the thesis/project or extended essays, if technically possible, to any medium or format for the purpose of preservation of the digital work.

The author has further agreed that permission for multiple copying of this work for scholarly purposes may be granted by either the author or the Dean of Graduate Studies.

It is understood that copying or publication of this work for financial gain shall not be allowed without the author’s written permission.

Permission for public performance, or limited permission for private scholarly use, of any multimedia materials forming part of this work, may have been granted by the author. This information may be found on the separately catalogued multimedia material and in the signed Partial Copyright Licence.

While licensing SFU to permit the above uses, the author retains copyright in the thesis, project or extended essays, including the right to change the work for subsequent purposes, including editing and publishing the work in whole or in part, and licensing other parties, as the author may desire.

The original Partial Copyright Licence attesting to these terms, and signed by this author, may be found in the original bound copy of this work, retained in the Simon Fraser University Archive.

Simon Fraser University Library Burnaby, BC, Canada

Abstract

Recently, the coverage of non-protein-coding RNA in the scientific literature has expanded

dramatically. While the functions for many are unknown, strong interest in this aspect

of cellular biology is driving development of methods for detecting non-coding genes and

transcripts.

During the same period, RNA sequencings high throughput and high spatial resolution

have established it as the preferred method for characterising transcriptomes. Many groups

are now sequencing transcriptomes. De novo transcriptome assembly methods are being

developed to address issues for which no reference genome is available.

We propose a methodology that is compatible with de novo transcriptome assembly,

that uses sequence, structural and genomic features to classify transcripts as non-coding vs.

protein-coding RNA, and to classify different non-coding RNA types. We have applied our

technique on a variety of known RNA sequences and have explored its use on contigs from

the Trans-ABySS assembly pipeline for RNA-Seq data from normal mouse tissues.

iii

To family and friends

iv

“As iron sharpens iron, so one man sharpens another”

— Proverbs 27:17

v

Acknowledgments

I wish to express my deepest gratitude to the many individuals whose support and assistance

made this work described in this thesis possible.

As my senior supervisor, I thank Martin Ester for giving me the academic and personal

guidance I needed. I am grateful for his patience and for encouragement during the entire

length of my research. I wish to also thank the members of my committee for their invaluable

counsel. I wish to thank the members of the Data Mining Lab, and the Computing Science

Department at Simon Fraser University for providing the environment I needed to perform

this research. Thanks especially to Phuong Dao for his expertise in countless matters.

This work was possible because of our collaboration with the Michael Smith Genome

Sciences Centre (GSC). I thank Inanc Birol, Jacqueline Schein, Pamela Hoodless and espe-

cially Gordon Robertson for providing me with so many opportunities, and for going above

and beyond their supervisory roles. I gratefully acknowledge the GSCs making available the

seven mouse transcriptome datasets generated in the Genome Canada MORGEN project. I

would like to acknowledge in particular: Sam Lee for generating the RNA reagents; Yongjun

Zhao who manages the library construction teams; Nina Thiessen and An He who applies the

GSCs production WTSS pipeline; and Shaun Jackman, Readman Chiu, Rong She, Jenny

Qian, Karen Mungall, for de novo contig data from ABySS and Trans-ABySS.

This work was funded by the Canadian Institute of Health Research / Michael Smith

Foundation for Health Research Bioinformatics Training Program. I am extremely grateful

that they have provided such a supportive community for bioinformatics research. I wish

to acknowledge in particular Marco Marra, Steve Jones and Sharon Ruschkowski.

Lastly, I wish to thank my family and friends for their unconditional love and support.

As chaotic as it seemed at times, they kept me grounded. Thanks to them, I will always

look back fondly at this time.

vi

Contents

Approval ii

Abstract iii

Dedication iv

Quotation v

Acknowledgments vi

Contents vii

List of Tables x

List of Figures xiii

1 Introduction 1

1.1 Significance of non-coding RNA classiffication . . . . . . . . . . . . . . . . . . 1

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 How this thesis is organised . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Biological background 5

2.1 Second generation sequencing and transcriptomics . . . . . . . . . . . . . . . 5

2.2 Central dogma of molecular biology . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Non-coding RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Related work 9

3.1 Discovery of non-coding RNAs . . . . . . . . . . . . . . . . . . . . . . . . . . 9

vii

3.1.1 Sequence based approaches . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.2 Secondary structure based approaches . . . . . . . . . . . . . . . . . . 10

3.1.3 Comparative Genomics based approaches . . . . . . . . . . . . . . . . 11

3.1.4 Genome scanning / mapping approaches . . . . . . . . . . . . . . . . . 11

3.2 RNA databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Classification 14

4.1 Preprocessing reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.1.1 Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.1.2 Mapping to RNA database . . . . . . . . . . . . . . . . . . . . . . . . 15

4.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2.1 Sequence based features . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2.2 Secondary structure based features . . . . . . . . . . . . . . . . . . . . 21

4.2.3 Genomic map based features . . . . . . . . . . . . . . . . . . . . . . . 23

4.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.3.1 Support vector machines . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3.2 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3.3 Cross validation evaluation . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3.4 Full contig prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3.5 Feature set ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 Implementation 30

5.1 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.1.1 Sequence based feature extraction . . . . . . . . . . . . . . . . . . . . 30

5.1.2 Secondary structure feature extraction . . . . . . . . . . . . . . . . . . 31

5.1.3 Genomic map based feature extraction . . . . . . . . . . . . . . . . . . 32

5.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.2.1 Support vector machine . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6 Experimental results 34

6.1 Coding and non-coding databases . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.1.1 EMBL and Swissprot vs. non-coding . . . . . . . . . . . . . . . . . . . 35

6.1.2 Ensembl protein coding vs. non-coding . . . . . . . . . . . . . . . . . 36

6.1.3 Ensembl vs. fRNAdb . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

viii

6.2 The RNA-Seq dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.2.1 Contig preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.2.2 Transcriptome reads mapped to the genome . . . . . . . . . . . . . . . 43

6.2.3 Contig assembly and merging . . . . . . . . . . . . . . . . . . . . . . . 46

6.2.4 Contig to annotation mapping . . . . . . . . . . . . . . . . . . . . . . 46

6.2.5 Contig cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.2.6 Full contig set classification . . . . . . . . . . . . . . . . . . . . . . . . 51

6.3 Feature ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

7 Conclusion and future work 63

7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Bibliography 66

ix

List of Tables

4.1 Features available from the prediction model. Sequence and secondary based

feature make up the de novo set of features. The concepts of the features are

described in section 4.2, and the implementation in section 5.1.. . . . . . . . . 19

4.2 Confusion matrix (or coincidence matrix) for a two-class classification prob-

lem. The correct predictions, true positive and true negative, are shaded

while the erroneous predictions, false positives and false negatives, are not. . 26

6.1 SSGC performance compared with PORTRAIT for the dataset composed

of Swiss-prot and EMBL for protein coding set, and Rfam, RNADB and

NONCODE for the non-coding set. Precision and recall are shown for the

non-coding class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.2 SSGC performance compared with PORTRAIT for the dataset composed of

Ensembl protein coding, and Rfam, RNADB and NONCODE for the non-

coding set. Precision and recall are shown for the non-coding class. . . . . . . 37

6.3 Binary classification performance between Ensembl protein coding with all

fRNAdb non-coding sequences. The first row represents the experiment where

all features are used. The second row represents the experiment where only

the de novo features were used. . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.4 Pairwise classification performance between Ensembl protein coding elements

vs. each RNA type found in fRNAdb. The first half represents the results

where all features are used. The second half represents the results where only

de novo features were used, thereby excluding genome mapped information

such as the number of exons and cross-species conservation scores. . . . . . . 39

6.5 Pairwise classification performance using the complete feature set for fRNAdb

non-coding RNA. Precision and recall are only shown for the second class. . . 40

x

6.6 Pairwise classification performance using de novo feature set for fRNAdb

non-coding RNA, similar to Table 6.5. Precision and recall are only shown

for the second class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.7 Confusion matrix for the multiclass classification using fRNAdb RNA types,

using the entire feature set. The cells represent the number of predictions for

each type, the shaded cells represent the number of true positives. Each RNA

type is labelled from a to i, representing in order: fly-smallRNA, mat-miRNA,

misc, piRNA, pre-miRNA, rRNA, snoRNA, snRNA and tRNA. . . . . . . . . 42

6.8 Six seven-lane RNA-Seq mouse libraries were exained. . . . . . . . . . . . . . 43

6.9 Six seven-lane RNA-Seq libraries were assembled, merged to create the contig

sets. These contigs were used as input for the classifier. . . . . . . . . . . . . 46

6.10 Classification performance using the contigs from the library MM0564, using

the full feature set. The contig sets are mapped to protein coding sequences

from Ensembl, and non-coding RNA sets from fRNAdb using a series of map-

ping thresholds. The top half of the table represents the classification results

using features extracted from the contig sequences. The lower half repre-

sents the classification results using the features extracted from the original

sequence from either Ensembl or fRNAdb that each contig mapped to. . . . . 49

6.11 Classification performance for the stratified contigs from library MM0564,

using the full feature set. In comparison to Table 6.10, the number of ele-

ments in each class are equal. The contig sets are mapped to protein coding

sequences from Ensembl, and non-coding RNA sets from fRNAdb using a

series of mapping thresholds. The top half of the table represents the classifi-

cation results using features extracted from the contig sequences. The lower

half represents the classification results using the features extracted from the

original sequence from either Ensembl or fRNAdb that each contig mapped

to. Note that for thresholds at 1.0, there are not enough elements to perform

classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

xi

6.12 Classification performance for the database sequences mapped by the unfil-

tered contig sets from MM0564; each classification is compared with POR-

TRAIT. The precision and recall is only shown for the non-coding class. We

were not able to compare the classification accuracies for the actual contig

sets themselves. Note the number of elements is lower for PORTRAIT due

to the size restrictions for their input. . . . . . . . . . . . . . . . . . . . . . . 53

6.13 The top twenty ranked features based on classification effectiveness from the

Ensembl and fRNAdb datasets. The first pair of columns lists the most

effective features from binary class experiements, coding versus non-coding.

The second pair of columns lists the features for the multiclass considering

RNA types and proteins. The last pair of columns is from the multiclass using

only RNA types. Both the complete feature set and the de novo feature sets

are considered in each of the three experiment types. . . . . . . . . . . . . . 61

6.14 Classification performance using incrementally, the top twenty ranked fea-

tures from the Ensembl and fRNAdb datasets, for the binary classifier. As

more features are added, there is a steady rise in the accuracy, precision

and recall. The full model containing all features has an accuracy of 96.3%,

precision of 0.966, and recall of 0.976 as shown in Table 6.3. . . . . . . . . . . 62

xii

List of Figures

1.1 A top level structure of our approach from the short read sequence down to

the classification of RNA transcripts. We are both interested in using reads

and contigs as part of the input as well as the potential to classify different

non-coding RNA families. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 The Central Dogma of molecular biology. On the left is the typical transcrip-

tion and translation steps for a given gene. The end product is translated

amino acid sequence that eventually forms a protein. On the right is the tran-

scription of a non-coding RNA, the 3-D structure consisting of its secondary

structure.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.1 Overview of the contig assembly and labelling procedure. From short read

transcriptome reads, contigs are assembled and merged. Contigs are mapped

individually to protein coding and non-coding RNA datasets. Contigs inherit

the labels of the database elements with the best matched mapping score,

which must be above a set threshold. For each mapping score, there are two

threshold values, one for the contig and one for the annotation. The labelled

contigs are used as training and testing sequences for the classifier. . . . . . . 16

4.2 The classification approach starting from the sequence reads down to the test-

ing of RNA transcripts. We propose a classifier that draws on three categories

of features based on sequence, secondary structure, and genome mapped data,

which we name the Sequence-Structure-Genome Classifier (SSGC). For de

novo experiments, we only consider sequence and secondary structure based

features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

xiii

4.3 Contig prediction procedure for the full contig set. A subset of contigs

mapped to protein coding and non-coding sequences from Ensembl and fRNAdb,

respectively, are used to train an SVM model. The SVM model is used to

classify the entire contig set, predicting the class and p-value for each contig.

The p-value allows the contigs to be ranked, from strongly protein coding (0)

to non-coding (1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.1 Read coverage for Ensembl broken down to biotypes, for RNA-Seq reads

from library MM0564. Each biotype is represented as an ECDF and as a

distribution of log10 read coverage. . . . . . . . . . . . . . . . . . . . . . . . 44

6.2 Empirical cumulative distribution function representing the read coverage for

a select number of Ensembl biotypes mapped to the mm9 reference genome

from Figure 6.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.3 Number of unique contigs that map to the sequence annotation databases

fRNAdb and Ensembl using a range of mapping thresholds for all six mouse

libraries. (a) and (c) represent the filtered contig set mappings, (b) and (d)

represent the unfiltered contig set mappings. . . . . . . . . . . . . . . . . . . 48

6.4 Ensembl transcripts mapped by filtered (a) and unfiltered (b) MM0564 con-

tigs, broken down into individual biotypes. . . . . . . . . . . . . . . . . . . . 50

6.5 fRNAdb transcripts mapped by filtered (a) and unfiltered (b) MM0564 con-

tigs, broken down into individual RNA types. . . . . . . . . . . . . . . . . . 51

6.6 The full MM0564 contig set is predicted by the SVM model, and are assigned

probabilities. Contigs with p-values below 0.5 are classified as protein coding,

while contigs with p-values above 0.5 are classified as non-coding. (a) is the

class prediction for all contigs. (b) is the p-value distribution of all the contigs,

(c) is the p-value of contigs with no alignments to any known non-coding

transcripts. (d) is the p-value for all contigs 500bp and larger. . . . . . . . . 55

xiv

6.7 Mapping scores and sizes of contigs strongly predicted as protein coding (p-

value ≤ 0.05) and non-coding (p-value ≥ 0.95). a,b) Distribution of mapping

scores with the best-aligned a) protein-coding Ensembl sequence, b) non-

coding fRNAdb sequence. c,d) Distribution of contig sizes (white). In (c),

the red regions represent strongly protein coding (p-value ≤ 0.05) which do

not map to any known sequences in Ensembl or fRNAdb. In (d), the orange

regions represent strongly non-coding (p-value ≥ 0.95) which do not map to

any known sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.8 Contig k50:177614 aligned in the mouse mm9 genome. The top track rep-

resents the multiple contigs that are mapped to this location. The second

set of tracks are the pileups for the RNA-Seq read alignments for the six

mouse transcriptome libraries. Below the contig track is the gene track and

the conservation track. This contig has a p-value of 1.0 and does not map to

any known non-coding or protein coding sequences. . . . . . . . . . . . . . . 57

6.9 Contig k29:3267973 aligned in the mouse mm9 genome. Similar to Figure

6.8, the tracks represent the assembled contigs, RNA-Seq read pileups, the

contig, known gene annotations, and conservation. This contig has a p-value

of 0 and does not map to any known non-coding or protein coding sequences. 59

6.10 Contig k29:3267973 (from Figure 6.9) represented in the human hg18 genome,

using the LiftOver tool from the UCSC Genome Browser [62]. The tracks rep-

resent the contig coordinate (from the LiftOver), the contig BLAT alignment,

known human gene models, histone modification tracks, and the conservation. 60

xv

Chapter 1

Introduction

1.1 Significance of non-coding RNA classiffication

The Central Dogma of molecular biology states that the flow of genetic information is from

deoxyribonucleic acid (DNA) to ribonucleic (RNA), and RNA to protein [1]. Although

exceptions to this rule were known, for example transfer RNA and ribosomal RNA, diverse

types of non-coding RNA, i.e. transcribed RNA elements that do not code for proteins [85],

are increasingly recognised as widespread and functionally important. Non-coding RNA may

help resolve what has puzzled many researchers since the initial discoveries in genomics—

the number of genes are not significantly different between species; both protein coding and

non-coding RNAs appear to be important in the variability between species [5].

The importance of the non-coding transcriptome has motivated work to develop methods

of identifying and classifying non-coding genes and transcripts. There are three approaches

in the literature. The first uses a set of computationally inexpensive features that can be

mined for patterns that distinguish sequences as coding or non-coding genes; many methods

use variations of clustering or a support vector machine (SVM). Widely used features include

GC content, contig length, open reading frame (ORF) length, stop codon quantity, nucleic

acid composition, amino acid composition, and protein complexity. These sequence-based

features are properties calculated from the sequence at hand. Second, because non-coding

RNA transcripts have functional secondary structures, a number of approaches assume that

the secondary structures predicted for a sequence can be used to calculate the likelihood

of the transcript being non-coding. Examples include RNApfold and RNAz, both of which

are a part of or rely on the Vienna RNA package [49, 50]. As structural features are

1

CHAPTER 1. INTRODUCTION 2

more computationally expensive, attempts have emerged that use sliding windows. Finally,

a number of approaches use genomically-mapped evidence like transcript and expressed

sequence tag (EST) alignments, chromatin profiles and evolutionarily conserved regions.

For this project, we used mapped RNA-Seq reads, mapped de novo contigs, chromatin

profiles, and conservation data.

RNA-Seq, based on second generation deep sequencing technologies, is an effective tool

for quantifying the expression levels of the transcriptome using short sequence reads orig-

inating from fragmented transcripts [132]. Although RNA-Seq has been primarily used to

detect the transcript of protein coding RNA, the technology has increasingly been applied

to detect non-coding RNAs [32, 67, 59].

For this thesis, we introduce the Sequence-Structure-Genome Classifier (SSGC). Using

SSGC, we investigate the transcript classification problem using short-read sequencing data.

Existing studies on non-coding RNAs, using RNA-Seq have relied on mapping reads to a ref-

erence genome; we investigate the classification problem using contigs from a non-reference

based approach, using the de novo transcriptome assembly. By introducing assembly to

non-coding RNA classification, we allow the ability to work on de novo settings. In our in-

vestigation, we also take in consideration the large sizes and noisy nature of these datasets.

We demonstrate the effectiveness of the various feature sets under an assortment of test

conditions.

These will be the major steps in building and running SSGC:

1. Create contigs

- RNA-Seq assembly

2. Build classifier

- label contigs as protein coding or non-coding RNAs

- train SVM model using labelled contigs

3. Run contigs on classifier

- include genomically mapped evidence as attributes


Reads

Contigs

Protein codinggenes

non-codingRNAs

miRNA

tRNApiRNA

lincRNA

lncRNA

rRNAsnoRNA

pre-miRNA

Figure 1.1: A top level structure of our approach from the short read sequence down to theclassification of RNA transcripts. We are both interested in using reads and contigs as partof the input as well as the potential to classify different non-coding RNA families.

1.2 Contributions

In this thesis, we extend existing work in which transcript sequences from public databases

were classified into two groups, i.e. protein coding vs non-coding. First, we extend the

classification to discriminate between non-coding RNA families (Figure 1.1). Then, we

apply the classifier to RNA-Seq data, and to de novo transcriptome assembly that uses such

short-read data to generate contigs [110]. De novo assembly can be used with non-model

species for which a reference genome is not available, and can detect chimeric transcripts

that are not represented by a reference genomes gene models but can be important in disease

(ref). We show that non-coding RNA family types can be identified in RNA-Seq data, and in

de novo transcriptome contigs. We outline potential constraints, related to expression level

and sequencing depth, in comprehensively characterising non-coding RNA in sequence data.

The software developed for this thesis is available for use with high-throughput RNA-Seq

and de novo transcriptome assembly pipelines.


1.3 How this thesis is organised

The first three chapters give background material: Chapter 2 briefly summarises the bio-

logical concepts, and Chapter 3 summarises published work related to the thesis. The next

two chapters describe the classifier: Chapter 4 explains concepts, and Chapter 5 provides

details on the tools and methods used. Chapter 6 explains the results of using the classi-

fier on database sequences and de novo transcriptome contigs from real biological samples.

Chapter 7 concludes with final remarks and possible future directions.

Chapter 2

Biological background

Bioinformatics is an interdisciplinary study and a wide variety of topics are covered in this

thesis. This section acts as a primer to the biological terms and concepts that are used in

this thesis.

2.1 Second generation sequencing and transcriptomics

DNA sequencing has existed since the beginning of molecular biology. The Sanger method [113]

is the well known and revolutionary first generation technology based on dideoxy chain ter-

mination; first generation technology has been used to unlock sequences lengths in the order

of several hundred base-pairs. Second generation sequencing technologies emerged decades

later, towards the end of the first human genome project. The dominant platforms, Illu-

mina, Roche 454, and ABI SOLiD, have high throughput but generate shorter sequence

reads [86].

Transcription is the synthesis of RNA ribonucleotides using polymerase and a DNA se-

quence as the template. Transcriptome studies have been an important part of molecular

biology and bioinformatics research as expressed RNA is often a precursor for protein syn-

thesis [1]. RNA-Seq, or whole transcriptome shotgun sequencing, is a recently developed

method that uses second generation sequencing on a transcriptome to survey the RNA ex-

pression landscape [92, 91]. RNA-Seq is performed by capturing RNA transcripts by their

poly-A tail, converting the RNA sequence to double stranded DNA by reverse transcriptase,

fragmenting and sequencing using second generation technology. [132]. RNA-Seq has been

shown to be effective in profiling the expression level of transcripts [132, 81, 4, 92], as well

5

CHAPTER 2. BIOLOGICAL BACKGROUND 6

as identifying novel transcription events [110, 41, 126, 40].

2.2 Central dogma of molecular biology

Molecular biology is the study of the formation, organisation and activity of macromolecules

essential to life [56]. This is encapsulated by the Central Dogma, one that states that the

flow of genetic information in cells is from DNA to RNA to protein [1]. For a given gene, this

can be broken down into two steps: transcription and translation (Figure 2.1). Transcription

is the process of synthesising a chain of RNA oligonucleotides from the sequence of a DNA

template. The resulting oligonucleotide chain, or transcript, is known as the messenger

RNA (mRNA).

Translation is the process of synthesising amino acid polymers by reading the open

reading frame (ORF) found within the transcript sequence. The ORF of a transcript is

the segment of the transcript that is used to encode the amino acid sequence. It is the

chemical properties of the amino acid, or peptide, sequence that give it its structure and

function. The regions outside the ORF of a transcript is called the untranslated region

(UTR). Transcripts, as DNA and RNA, have a direction of synthesis and transcription.

The beginning of the transcript starts with the 5′ end and terminates at the 3′ end. From

the original sequence of a DNA source, transcripts are appended with a 5′ cap containing

a modified guanine nucleotide and a poly-adenylation (poly-A) tail on the 3′ end consisting

of a long set of adenosine sequences [1].

2.3 Non-coding RNA

Despite the fundamental significance of the Central Dogma, we have come to realise im-

portant exceptions of this principle. Of the dry weight of RNA extracted from a cell, only

3-5% consists of mRNA, similar to the proportion of genes that make up the genome [1]. In

contrast, as much as 62% of the mouse genome [125], 85% of the fruit fly genome [80], and

93% of the human genome [8] has been estimated to be transcribed.

Non-protein coding, or non-coding RNAs, are RNA products that are not translated to

proteins after transcription (Figure 2.1). Recently there has been an explosion of micro-RNA

(miRNA) research and their critical roles as gene regulation [85, 97], and their implications

for tumorigenesis [111, 13, 84, 131]. miRNA, along with other small RNAs were once named


the breakthrough of the year by Science magazine [23]. Overall, there are a number of non-

coding RNA types such as those involved in the translation process, ribosomal RNA (rRNA)

and transfer RNA (tRNA); small non-coding RNAs such as micro RNA (miRNA), small

interfering RNA (siRNA), small temporal RNA (stRNA), small nuclear RNA (snRNA),

small nucleolar RNA (snoRNA), piwi-interacting RNA (piRNA); and the more elusive long

non-coding RNA (lncRNA) which include long intergenic non-coding RNA (lincRNA).

ORF features

Protein coding mRNAs have characteristics that are well defined as explained earlier. ORFs

are mostly thought to be unique to that of protein coding genes. There are exceptions to

this concept as bifunctional RNAs have been documented to have functioning ORFs [25, 2].

There are however controversies surrounding non-coding RNAs as the function of many

annotated non-coding RNAs are not known. Of the transcript products found in the FAN-

TOM database [125], there are reports that many of the transcripts are the result of unde-

graded protein coding mRNA, undegraded introns, internal priming, putative protein coding

genes and some have low conservation across species [95]. This have also been reports where

large deletions in gene deserts associated with non-coding DNA had no effect on mice [93].

Recently, comparing newer RNA-Seq methods to potentially noisier microarrays have shown

that non-coding RNAs may not be transcribed as once thought [91].


pre-mRNA

genome

mRNA

non-coding RNA

folded non-coding RNA

protein

introns

ORF

exons

transcription

translation

transcription

peptide sequence

poly-A tail5’ cap

5’ UTR

3’ UTR

Figure 2.1: The Central Dogma of molecular biology. On the left is the typical transcriptionand translation steps for a given gene. The end product is translated amino acid sequencethat eventually forms a protein. On the right is the transcription of a non-coding RNA, the3-D structure consisting of its secondary structure.1

———————————

13-D images from PDB (http://www.pdb.org/) and EBI (http://www.ebi.ac.uk/)

Chapter 3

Related work

Many non-coding RNAs have been known for decades [27], though it is only recently where

various computational methods to detect these entities have started to emerge. Using various

methodologies, many attempts have been made to classify, find, validate and store non-

coding RNAs. In this chapter, we summarise these methodologies.

3.1 Discovery of non-coding RNAs

In this section, we review strategies in the literature that find non-coding RNAs by cate-

gorising the methods into groups based on sequence, structure, comparative genomics, and

scanning methods.

3.1.1 Sequence based approaches

Sequence based methods classify entities as non-coding RNAs or protein coding RNA by

using the primary nucleotide sequence as input. The literature shows that many biologically

relevant features can be extracted from the sequence such as GC content, sequence motifs,

and nucleotide usage. The extracted features can be converted to numerical values that can

be fed into a machine learning model.

CRITICA [6] uses two types of features: comparative genomics features that use DNA

alignment from a DNA database (refer to section 3.1.3), and sequence based features that

compute distributions of hexanucleotides in coding frames and take into account dicodon

biases. DIANA-EST [45] uses artificial neural networks to find coding regions from ESTs.

9

CHAPTER 3. RELATED WORK 10

ESTSCAN [76] also finds the coding regions of ESTs using a Hidden Markov Model. POR-

TRAIT [3] and SOM-PORTRAIT [119] both extract sequence and ORF-related features

and performs classification using support vector machines and artificial neural networks.

CONC [74] and CPC [64] uses a large collection of simple features such as length, amino

acid composition, GC content, nucleotide identity, 3-periodicity, and simple thermodynam-

ics, to feed into a machine learning method to perform the classification; a large source of

their information does come from comparative methods using BLASTX. Creanza et al. [24]

and Re et al. [104] also use a large collection of features to perform classification, the most

effective feature reportedly being synonymous nucleotide substitutions. Clamp et al. [18],

Li et al. [72], Jia et al. [58], and Wu et al. [137] use methods to extract the open reading

frame of transcripts. Siederdissen et al. [117] uses covariance models using only sequence

information to distinguish between many non-coding RNA families.

3.1.2 Secondary structure based approaches

Secondary structure based classifiers assume functional non-coding RNA have secondary

structures that can be fully or partially predicted and used to extract properties to distin-

guish non-coding RNA from other elements. These properties can include stem loop related

features that can include prevalence, size and GC content [94, 122], while other strategies

estimate fold energies in both global and local contexts. Also, despite the fact that 3′ UTRs

of mRNAs also contain secondary structure [25], a number of secondary structure based

methods have been shown to have reliable rates of success. Another major consideration is

that secondary structure prediction is computationally expensive, forcing workarounds such

as local secondary structure input. These methods perform a scan of the input sequences

and for every window calculate the local secondary structure and consequent attributes.

Xue et al. [139] and Noel et al. [94] uses a method of extracting local features within

the largest stem loop to classify real and pseudo miRNA precursors. The miRanalyzer

web tool [42] scans the genome using the local secondary structure prediction program

RNAfold [51] and for every window extract features strongly related to folding and loop

energy such as length, stem length, Mfe, and GC. Classification is done using the random

forest scheme found in the WEKA package [43]. Langenberger et al. [67] scans for RNA

folds in a sliding window along mapped reads. Horesh et al. [52] also implemented their

method by a sliding window method along a genome to find locally stable RNA structures

and investigates dinucleotide biases that have an effect on the minimal free energies. Childs


et al. [16] builds a classifer to infer functionality based on a system where each molecule

of a RNA structure is represented as a graph. miRTRAP [47] assess features derived from

loops of miRNA to identify miRNAs from high throughput sequencing data.

3.1.3 Comparative Genomics based approaches

Another common method of finding non-coding RNA is to use information from several

sources such as alignment data from related species. This method is known as comparative

genomics. These methods are especially useful when genomic and transciptomic information

from related species are known. Many approaches use a combination of existing tools such as

ClustalW [68], consensus structure prediction, sequence aligment properties [28] and aligned

structure analysis [130, 33, 133, 24].

RNAz [133] was one of first major methods to predict functional non-coding RNA by

using a combination of sequence alignments, secondary structure and SVM classification.

Dynalign [128] detects non-coding RNAs by predicting secondary structures and thermal en-

ergy for multiple aligned RNAs using a combination of methods including using RNAz [133]

and QRNA [107]. Mignone et al. [87] compares the genomes of human and mouse to find

conserved sequences to evaluate protein coding potential using the notion of conserved se-

quenced tags (CSTs) to produce blocks of BLAST-like high scoring pairs. Voß et al. [130]

predicts non-coding RNAs by using the alignment tool ClustalW [68] and the consensus

structure prediction tool RNAlishapes [129]. Weinberg et al. [134] has uncovered non-coding

RNA by using a number of structure and motif based methods such as CMfinder [140]. Cen-

troidFold [114] is a web server for RNA secondary structure prediction engine that takes in

an RNA sequence along with its alignment as input. Mathelier et al. [83] finds miRNA using

5 parameters that are heavily influenced by fold properties and energies. Tseng et al. [127]

uses genome scale blasting that combines secondary structure and primary sequences by

using folded-BLAST in intergenic regions.

3.1.4 Genome scanning / mapping approaches

The last category we investigate are methods that find non-coding RNA by incorporating

genome scanning methods to identify new RNAs. These methods use the genomic sequence

as the primary input and use subtle clues to pinpoint locations of possible non-coding RNAs.

Although these are not directly part of this thesis, their goals and strategies are insightful


for our purposes. This category includes strategies that observe motifs and read alignments

from transcriptomes.

Hiller et al. [48] scans the genome for conserved introns to find novel transcripts especially

focusing on the set of mRNA-like non-coding RNAs. Salari et al. [112] employs a method

of scanning motifs along a reference genome using k-mer motifs lengths. Erhard et al. [30]

and Chol et al. [59] both use mapped reads from transcriptome experiements and mainly

use their position and size to find and classify non-coding RNA on the genome. Hofacker et

al. [50] uses local RNA folding on a genome wide scale to discover potential RNA structures.

3.2 RNA databases

In response to the expanding set of non-coding RNAs discovered, a number of databases

have emerged to accommodate their unique characterisics. Many cater to specific types

while others are more inclusive.

Although technically a transcriptome database, FANTOM [125] is known to house many

known and unknown EST sequences including non-coding RNAs. RNAdb [99], fRNAdb [63],

NONCODE [46], and RFam [36] are databases that have their own set of classifications or

family types and all have a user interface available publicly on their servers. RFam [36]

is a database of published non-coding RNAs that uses various tools in covariance models

to WU-Blast to catogorise entries to their extensive categorical families. RNAdb [99] is

a database that specifically applies to mammalian non-coding RNAs, combining several

sources. fRNAdb [63] is a database that aims to categorise functional RNA candidates and

includes tools to analyse structure motifs and EST support evaluation. NONCODE [46]

examines a number of non-coding RNA family types (excluding tRNAs and rRNAs) and

categorises these non-coding RNAs into nine biological related categories.

The following are databases that are specific to a special niche. miRbase [38] is a

database specifically for miRNAs and lists detailed information on both pre and mature

miRNA structures along with a target prediction pipeline. piRNABank [66] is a database

specifically for PIWI interacting RNAs. Sno/scaRNAbase [138] is a curated database for

nucleolar RNAs and cajal body-specific RNAs. NRED [25] is a database containing only long

non-coding RNAs 200 nucleotides or larger taken from microarray and in situ hybridisation

experiments for the mouse and human. ncRNAimprint [141] is a database of mammalian

non-coding RNAs that are imprinted. lncRNAdb [2] is a database for long non-coding


RNAs that have biological functions in eukaryote cells and viruses, which include functional

mRNAs.

Chapter 4

Classification

The goal of this thesis is to create a practical, accurate and reliable classifier that can

distinguish different classes of transcript sequences from noisy data in real biological settings.

In particular we classify protein coding from non-protein coding RNA, in data derived from

RNA-Seq experiments, i.e. from short sequence reads. Using de novo assembly we generate

transcript contigs that represents the transcriptional landscape.

This chapter describes the concepts of the various aspects of our classifier, SSGC, which

aims to fulfil these goals. Section 4.1 describes concepts of the RNA-Seq reads and their

pre-processing. Section 4.2 describes the features used to classify input sequences. Section

4.3 describes the concepts of the classification and how its performance can be assessed.

4.1 Preprocessing reads

The output of the RNA-Seq procedure consists of very short fragments of RNA sequences.

As we are interested in working with long sequences that depict transcripts, we utilise the

process of assembly to build contig sequences.

4.1.1 Assembly

Assembly is a process in which contiguous sequences, or contigs, are created by piecing

together smaller sequences. ABySS [120] is a popular assembler program as it has been

successfully demonstrated on transcriptome sequencing [9]. ABySS is based on the de Bruijn

graph model, first introduced by Pevzner et al. [100]. This method fits into the category of

14

CHAPTER 4. CLASSIFICATION 15

de novo assemblers, i.e. one that uses only the short read sequence information, without

any external data source such as the reference sequence.

De Bruijn graphs using short read sequences rely on a given value k, such that sequencing

reads are chopped up into k-mers, or k length subsequences. Each k-mer is represented in

the graph as a node, directed edges represent k − 1 overlaps between adjacent k-mers, and

the paths traversed along edges represent contiguous sequences or contigs assembled from

sequenced reads. One of the challenges with de Bruijn based assemblers is that depending

on the coverage and the value k, this can lead to a high number of fragmented or non-

contiguous contigs [9], though some fragmentation is unavoidable due to repeats and low

coverage. It is also unclear if assembly is the sole cause of fragmentation as it can also

be argued that cDNAs such as those found in the FANTOM database are also fragmented

versions of longer transcripts [35].

To reduce the amount of fragmented short contigs, a merging technique has been shown

to be successful [110]. This technique is based on the strategy of assembling a large set of

contigs using multiple k-mer values, then removing all contigs where it is a perfect subse-

quence of another contig. This procedure is also accompanied by a filtering step to further

reduce the number of small contigs.

4.1.2 Mapping to RNA database

Our approach is to not only run, but to train the classifier using contigs; contigs must be

assigned a label from the class definitions. After assembly, contigs sequences are mapped

to protein coding and non-coding RNA databases. Based on the mapping criteria and

threshold set, subsets of contigs inherit the labels of the elements in the databases (Figure

4.1). In the case of multiple mappings, contigs are assigned labels in a greedy manner,

based on mapping score. The resulting set of labelled contigs are used to train and test the

classifier.

To assess the performance of the classifier on contig sequences, we first create class labels

for each contig sequence. This is done by mapping each contig sequences to known protein

coding and non-coding sequences based on mapping scores. This is performed by using the

BLAT aligner [61] between the annotated database entries and the contig set. For each

contig-annotation pair, we can choose to accept or reject the pairing by comparing BLAT

alignment parameters batc and bata, for contig and annotation respectively, to threshold

values. The parameters are calculated as: batc = numbasesmatch/lengthcontig, and bata =


RNA-Seqreads

assemble&

merge

protein codingmRNA database

non-coding RNAdatabase

contigs

0.85 ; 0.83

0.88 ; 0.87

0.71 ; 0.70

0.79 ; 0.77

0.93 ; 0.95

map

0.84 ; 0.91

Figure 4.1: Overview of the contig assembly and labelling procedure. From short readtranscriptome reads, contigs are assembled and merged. Contigs are mapped individuallyto protein coding and non-coding RNA datasets. Contigs inherit the labels of the databaseelements with the best matched mapping score, which must be above a set threshold. Foreach mapping score, there are two threshold values, one for the contig and one for theannotation. The labelled contigs are used as training and testing sequences for the classifier.


numbasesmatch/lengthannotation. To find the best annotation mapping for a given contig,

we choose the annotation with the highest score calculated by score = batc + bata. The

procedure of assigning contigs to annotation consists of the following steps: set a threshold

between 0 and 1; calculate the score for each contig and annotation pair with each bat term

above the threshold; from the highest to the lowest score, label the contig as the annotation

and remove all future instances of the contig and annotation from consideration.

4.2 Feature extraction

Given a set of sequences, the classifier attempts to distinguish the set into classes, whether

that be protein coding and non-coding, or non-coding RNA family types. This is done by

extracting features, or properties attained from the sequence. This section describes the

features used by the classifier. The features are categorised as sequenced based features,

structure based features, and genomic map based features, represented in Figure 4.2 and

further expanded in Table 4.1. The following sections describe the features at a conceptual

level, and section 5.1 provides further details on the implementation.

4.2.1 Sequence based features

Various methods found in the literature have explored features directly computed from the

sequence itself. The functional unit of proteins are the peptides folded in a three dimensional

manner while the functional unit of many non-coding RNAs are the their secondary struc-

ture. The selection pressures of the functional units are responsible for many features that

are embedded in the sequence information of coding and non-coding RNA transcripts [117].

This section explains the methods involved extracting sequence based features from a given

sequence.

Nucleotide usage

From the four nucleotides that make up the alphabet used in RNA, there are reports of

certain biases in the nucleotide composition of certain transcript types. One way to measure

the composition is to compare the distribution of unigrams, bigrams, and trigrams for the

entire length of the transcript. This itself creates 84 vectors representing each possible

word: 64 possible trigram combinations, 16 possible bigrams, and 4 possible unigrams. An


Reads

Contigstraining

Sequence Secondary structureGCLengthORFNuc. comp.…

Genome mappedLoop lengthBulgesStem lengthLoop GC…

Exon coverageConservationChromatin…

SVM

Model

Contigstesting

ReferenceGenome

Figure 4.2: The classification approach starting from the sequence reads down to the testingof RNA transcripts. We propose a classifier that draws on three categories of features basedon sequence, secondary structure, and genome mapped data, which we name the Sequence-Structure-Genome Classifier (SSGC). For de novo experiments, we only consider sequenceand secondary structure based features.


NumberCategory Feature name of features

SequenceGC Content 1Length 1Nuc. composition (1,2,3-mers) 84

Sequence - ORF

ORF size 1framefinder 6Comp Entropy 1Isoelectric point 1Mean hydropathy 1a.a. composition 20

Secondary

Total MFE 1Best MFE window 1Min. stem energy 1Stem length 1Stem GC 1Stem loop GC 1Stem bulge asym 1Stem bulge sym 1Stem bulge total 1Stem max bulge 1Triplet-SVM feats 32

Genomic Num exons 1

Genomic - Conserv

Exons conserved 1Total score 1Bases conserved 1Bases conserved with coverage 1Mean coverage 1

Genomic - Histone

Exons conserved 1Total score 1Bases conserved 1Bases conserved with coverage 1Mean coverage 1

Total 169

Table 4.1: Features available from the prediction model. Sequence and secondary basedfeature make up the de novo set of features. The concepts of the features are described insection 4.2, and the implementation in section 5.1..


alternative is to compute the single feature, GC content (essentially the merging of two

bins, C and G divided by the total number of nucleotides), that has been used in the

past to distinguish coding from non-coding transcripts [67, 104]. These use the tendency

that protein coding GC content is approximately 50%, statistically distinct from intergenic

sequences [79, 24].

Length

Among the non-coding RNA families, two classes, tRNAs and miRNA stand out as they

have a well defined structure and length. [1] As such, mining for these particular non-coding

RNAs in a large dataset has shown to be possible by restricting the length of the transcript

and/or the secondary structure [67, 47]. Non-coding RNAs can vary greatly in length,

with transcripts smaller than 200 nucleotides are often associated with microRNA, PIWI-

associated RNAs, endogenous small interfering RNAs [25]. RNAs in the long non-coding

RNA class have transcripts in the same order of magnitude as protein coding genes with

some transcripts as large as a hundred kilobases in length [99].

ORF features

Protein coding mRNAs have characteristics that are well defined: they have a 5′ cap, 5′

and 3′ untranslated regions, an open reading frame and a polyadenylated tail [1], refer to

Figure 2.1. The portion of RNA that becomes translated to a peptide sequence is called

the open reading frame (ORF) and this is mostly thought to be unique to that of protein

coding genes; exceptions to this rule are bifunctional RNAs which are documented to have

functioning ORFs [25, 2].

A crude way to detect ORFs within a transcript sequence is to search for the longest

ORF from within one of the 6-frame translations, those that begin with the start codon

and end with the stop codon. There are much better and robust methods as proposed by

Slater et al. [121] and Shimizu et al. [116] that use machine learning methods that take into

account erroneous input sequences and frameshifts.

Once an ORF is predicted, we can investigate the protein coding biases such as the

log-odds score, compositional entropy, the amino acid composition, isoelectric point, and

mean hydropathy. However, there is a drawback such that if a protein coding gene’s ORF

is mis-predicted, the following features will likely yield poor results.


The amino acid composition is the makeup of amino acids used for the peptide sequence,

this can be measured as a histogram of amino acid unigrams. This can be a crude measure

to distinguish from the assumed random peptide sequence expected from a non-coding

RNA. The log-odds score is an effective and often used measure of the likelihood that a

given sequence is not from a random source. This makes use of the fact that of the 64

possible codon triplets, there are heavy biases in the usage found in nature. By measuring

the in frame nucleotide usages, the log-odds score gives a measure to the quality of the

sequence [137].

Compositional entropy is another term to describe the degree of low-complexity regions

that can occur in a peptide sequence of the ORF. Low complexity regions are repetitive

or homopolymeric sequences such as Ser, Asn, Gln, Asp, Glu and Thr residues [37] found

in peptide sequences that code for peptides in nonglobular domains. These can consist of

repetitive sequences found in the peptide. Although their function is not known, this is a

well documented trait found in many protein coding genes [101].

An isoelectric point for a protein is the pH in which it has no net charge. By examining

the amino acid side chains of a peptide, the buffering characteristics can be determined at

different pH levels. Since living systems have very narrow ranges of pH, it is expect that

peptide sequences would also have a narrow range of isoelectric points to be useful in a

living organism [1].

Hydropathy is used here to measure how hydrophobic regions of as peptide are, i.e.

whether they are polar or non-polar depending on the side chains of the amino acids used.

Kyte and Doolittle [65] proposed a method to calculate the hydropathy character of a

protein. Here we use the mean hydropathy across the entire length of the peptide sequence,

which may be problematic due to peptides hidden in globular pockets in a folded protein

structure.

4.2.2 Secondary structure based features

RNA secondary structure

Some non-coding RNA types are known to have secondary structure that are key to their

function, such as ribosomes and tRNAs. Here we assume that there are no significant


secondary structures associated with protein coding RNAs. From a long chain of ribonu-

cleotides, secondary structures result from segments of intramolecular base pairing, result-

ing in distinguishable structure such as stems, loops and bulges. Given a ribonucleotide

sequence, the most likely secondary structure would be the one with the lowest free energy

among all candidate sequences. However, to compute all possible candidates is unfeasible

due to the sheer size of the structures possible [142]. Lyngs and Pederson [78] show that

prediction of secondary structures taking into account pseudo-knots is NP-complete.

Zuker and Stiegler [143] describe a O(n3) dynamic programming algorithm under the

conditions that it assumes a simplistic thermodynamic model and it disregard pseudo-knots.

The Vienna package [50] contains an implementation of this global secondary structure in

addition to a O(nl2) local secondary structure prediction that only considers sub-structures

within a sliding window of size l of the input sequence. It has been shown that non-

coding RNAs can be reliably detected solely by using local structures such as hairpins and

stemloops [31].

We examine RNA folding ability for each of the transcripts by predicting the pseudo-

knot free secondary structure. From its success in distinguishing miRNA and pre-miRNA,

we focus on the quality of stem loops as shown in Xue et al. [139] and Hackenberge et

al. [42]. By extracting the longest stem loops, these methods are able to extract features

based on the length, GC content, number of symmetric and asymmetric bulges and structure

motifs and feed them to a machine learning program to do their predictions. In addition

to these features, we also extract the triplet-SVM features proposed by Xue et al. [139].

By feeding in a secondary structure represented by an alphabet of brackets and dots and

the ribonucleotide sequence, we can compute the occurrence of each of the eight possible

trigrams (combinations of dots and brackets) for each of the four RNA bases that represent

the middle character of eight possible trigrams: [(((, ((., (.(, (.., .((, .(., ..(, and ...].

There is clearly a potential in investigating secondary structures but at the same time

a limitation of exclusively examining dynamic programming solutions. One of the major

drawbacks is that dynamic programming solutions work to get the minimum free energy

structure; however, the biologically functional RNA product is not always the candidate

structure with the minimum free energy [115].

Another practical issue is that computing structural motifs will be very computationally

expensive. It is expected that many large transcripts will significantly increase the running

time. In that case, we have two alternative options, either to only compute small contigs


below a certain size cutoff, or to run only localised structure predictions in a sliding win-

dow. Both strategies can potentially limit the structures predicted, and can additionally

be affected with the selection of size thresholds and step sizes. Our approach utilises the

sliding-window approach in the experimentation.

4.2.3 Genomic map based features

Genomic mapped strategies uses data that are mapped onto the genome coordinates. With

the ability to map transcripts back to the originating genome, several pieces of information

become available. The two strategies used in this thesis are to observe the splicing patterns

of a transcript as well as mining data associated with the bases mapped to a transcript’s

genomic coordinate. As such, we are limited to using data for a species with a known

reference genome, thereby excluding its use from de novo type experiements.

For this thesis, we focus on extracting features relating to the number of exons predicted

and mapped as well as extracting data from the regions each transcript or contig maps to,

namely scores relating to evolutionary conservation and histone modifications explained in

the subsequent sections.

Evolutionary conservation

Genomic conservation is a tool to measure evolutionary distance between two or more species

for a particular location. Incorporated in our classifer, it is useful to measure specific

sequences on the genome that are conserved in order to detect functional regions in the

genomes [44, 75, 12, 60, 82, 136]. Analysing sequenced genomes and data from comparative

genomic studies, it has been shown that large portions of the genome are functional elements

that have not been identified [19, 15, 21, 20, 118, 89].

Two algorithms are often used to measure the conservation between species at a base-by-

base level on a reference genome: VISTA [34] and Phastcons [118]. Phastcons is an HMM

based program that uses phylogeny and genome alignments calculate conservation between

multiple species where VISTA calculates conservation between pairs of species.

In the context of classification, it is widely accepted that protein coding RNAs are

conserved [1], however there are inconsistent reports of conservation levels between protein

coding RNAs and non-coding RNAs. Studies have shown that long non-coding RNAs are

conserved across species in varying degrees [5, 17, 39, 57]. In contrast, it has also been


reported that conservation in only short non-coding RNAs are expected while longer non-

coding RNAs will not [98].

Histone modification data

The development of next-generation sequencing has not only provided more throughput and

smaller costs, it has found its way into many different applications. Chromatin immunopre-

cipitation (ChIP) is one such technology that utilises this powerful sequencing technology.

First described by Solomon et al. [123], ChIP uses cross linking between protein and DNA

to find a genome wide maps to where transcription factors bind. ChIP-Seq expands this

method by introducing next-generation sequencing and mapping to rapidly determine a map

of transcription binding sites [109].

Using ChIP-Seq technology, discovering sites of histone modifications associated with

gene expression has shown to be successful in studying their transcription factor bind-

ing [108]. In addition, chromatin state maps [88] have also been used to discover a large

set of long intergenic non-coding RNAs [39]. In this thesis, we investigate the effect of

our classifier using chromatin state maps for our task of distinguishing protein coding and

non-coding RNAs.

4.3 Classification

The primary goal of the classifier is to accurately detect whether an input RNA sequence

originated from a protein coding or a non-coding gene. The secondary objective is to further

classify a sequence that is predicted to be non-coding into its non-coding RNA family types.

To make the decision, the classifier makes use of features extracted from the three categories

of features described above. We investigate the classifier in two settings: one to assess the

performance by performing cross-validation of all contigs that map to known annotated

protein coding and non-coding sequences, and the other by running the classifier on the full

contig set to create a list of contigs ranked by prediction confidence. In both the training and

testing steps, features are processed and are ultimately fed into a support vector machine

that makes up the classifier model.


4.3.1 Support vector machines

The main engine used in determining the class and family types of RNA is a support vector

machine (SVM), a popular method used in classification, regression and novelty detec-

tion [10]. They have become particularly useful in classification problems in computational

biology due to their high accuracy, robustness with large, high-dimensional data and flexi-

bility in diverse data sources [7]. SVMs model classification problems by representing data

as points in high dimensional space. Within that space, SVM models learn a hyperplane

which maximally separates the two classes of a training dataset. SVM models are then used

to classify new instances [22, 135].

4.3.2 Performance evaluation

A standard procedure to assess the accuracy of a model consists of splitting a dataset into

training and testing sets; a model is created with the training set and are evaluated with

the test set. Cross validation is an alternative to this approach that uses multiple rounds of

classification and testing. This is especially useful when the size of the dataset is limited.

One such type is K-fold cross validation. It is performed by splitting the dataset into K

partitions, an SVM is trained using K − 1 partitions and evaluated with the remaining

partition. This is repeated for all partitions [22, 135].

For our thesis, we utilise cross validation to assess the performance of the classifier in

both the binary and multiclass classification problems. As SVMs are binary classifiers that

can only handle two classes, multiclass problems are addressed using strategies that combine

multiple rounds of one-against-one or one-against-all classifications combined with voting.

For our classifier, we rely on the one-against-one implementation [54].

For each classification experiment, the accuracy, precision and recall are calculated.

These are evaluated based on the true counts (TP and TN) and the false counts (FP, FN)

from the confusion matrix (Table 4.2).

Accuracy is a measure of the total number of correct predictions from the total sample

size [96].

Accuracy =TP + TN

TP + TN + FP + FN

Precision is a measure of accuracy for the true positives from all samples predicted as

true [96].

Precision =TP

TP + FP


Recall is a measure of all true positives that were correctly predicted from all samples

that are actually true [96].

Recall =TP

TP + FN

Predicted Class

Positive Negative

PositiveTrue Positive False NegativeCount (TP) Count (FN)

ActualClass

NegativeFalse Positive True NegativeCount (FP) Count (TN)

Table 4.2: Confusion matrix (or coincidence matrix) for a two-class classification problem.The correct predictions, true positive and true negative, are shaded while the erroneouspredictions, false positives and false negatives, are not.

4.3.3 Cross validation evaluation

We evaluate the performance of the classifier on annotated sequences. We investigate the

performance of the classifier on sequences with known class. This allows the ability to

evaluate the performance of the classifier under different settings.

Binary coding vs. non-coding classification

SSGC is applied on binary classification, the ability to differentiate coding from non-coding

RNA sequences. Physically, both sets of sequences can be similar as they are composed

of the same alphabet and overlap in sequence size. Using the features of the SSGC, we

demonstrate its ability in predicting the class of input sequences. This is performed using

SVMs with cross validation on sequences with known classes or on annotated contigs.


Multiclass RNA family classification

Many strategies found in the literature perform their classification based on the two crude

classes of non-coding RNA and protein coding mRNA. This can be a naive approach as

non-coding RNA have many family types that differ in size, structure and function. Our

classifier attempts to distinguish not just non-coding RNA from protein coding RNA, but

within the multiple non-coding families. Some family types that we apply our classifier to

include piRNA, miRNA, pre-miRNA, snoRNA, snRNA, tRNA, rRNA. To solve this multi-

class problem, we look to a one-versus-one implementation of the support vector machine

classifier. In addition to the different classes, we investigate a multi-phase classifier that

performs multiclass classification once protein coding sequences are removed.

4.3.4 Full contig prediction

Applying the classifier on labeled sequences enables the ability to evaluate the classifier.

However, this limits its use on sequences already known and classified. In particular, its

application on assembled contigs can only be used for annotations that are mapped to

known sequences. Although the performance cannot be directly determined, we investigate

the ability to predict the class of the entire contig set.

Classification on the entire contig set is achieved by first training an SVM model using

the subset of sequences mapped to known sequences. The model can then be applied to the

entire contig set to predict the class and the confidence of each contig (refer to Figure 4.3).

4.3.5 Feature set ranking

We also investigate the effectiveness of our feature set. It is possible that some features will

not be available for some datasets. Also many features do not apply to all possible transcript

types. Notably, numerous features associated with ORFs of proteins do not apply to non-

coding RNA, and analogously, secondary structure do not apply to protein coding genes. If a

transcript can be identified as a protein coding gene, we would be uninterested in measuring

the degree of secondary structure, just as we would be uninterested in computing ORF

feature for non-coding RNA. Computing unneeded features can be a strain on resources.

We investigate the features that are the most effective in our classification experiments.

Once the feature set is assessed, we propose subsets of feature are called upon for certain

conditions. Ultimately, we envision a multiple step classifier, one that will have multiple


Train model

Predict

Normalised feature vectors

full contig set

contigs mapped to proteincoding sequences

contigs mapped to non-coding sequences

protein coding

non- coding

Ranked contig predictionsby p-value

SVMmodel

Figure 4.3: Contig prediction procedure for the full contig set. A subset of contigs mapped toprotein coding and non-coding sequences from Ensembl and fRNAdb, respectively, are usedto train an SVM model. The SVM model is used to classify the entire contig set, predictingthe class and p-value for each contig. The p-value allows the contigs to be ranked, fromstrongly protein coding (0) to non-coding (1).


feature extraction and classification steps. For this thesis, we are interested in separating

transcripts representing all genes, then to separate the transcript to the multiple classes, as

shown in Figure 1.1.

Chapter 5

Implementation

This chapter describes the steps taken to construct the classifier, and to run the experi-

ments. Section 5.1 describes the steps involved in computing the features from a set of

sequences. Section 5.2 describes the steps used to assess the classifier performance, predict

novel transcripts, and to rank the features used.

5.1 Feature extraction

The classifier is designed to distinguish one set of sequences from another using a number

of feature extraction strategies. Feature extraction was designed as a set of modular tools

that can be turned on or off depending on the data available, the effectiveness, the time

and space constraints of the system used. The central programs are accessible from the

command line and are controlled by using a set of arguments as well as a configuration files.

In total, 169 features are configured for the classifier, 159 are de novo and an additional

10 are genome based. Table 4.1 lists the features used by the classifier. These features are

fed to a support vector machine that makes up the core of the model building and decision

making process. The proceeding sections explain in detail each of the components used in

the feature extraction procedure.

5.1.1 Sequence based feature extraction

Programming for sequence based feature extraction was done in Perl in a UNIX environ-

ment. Perl was used to manage the components of the system, perform some of the feature

30

CHAPTER 5. IMPLEMENTATION 31

extraction calculations and used as the scripting language that utilised the classification

tools.

Perl was used for feature extraction for the following feature types: GC content, length,

nucleoide composition, amino acid composition, ORF analysis, and through the BioPerl

libraries [124] isoelectric point and mean hydropathy. The pH of the amino acid side chains

used to calculate the isoelectric point were based on the values found in the EMBOSS

toolkit [106]. Mean hydropathy was calculated by using a BioPerl implementation of the

method proposed by Kyte and Doolittle [65].

To extract the ORF from a transcript or contig sequence, the ESTate package [121] was

used as it is specially tailored to handle potential sequencing and frameshift errors in the

input data making it ideal for assembled contigs. The training data was used to extract the

word usage and probabilities, and framefinder was used to do the ORF extraction and was

used to calculate the log-odds score.

Low-complexity regions were detected using the Compositional Bias Detection Algo-

rithm [102] using the default values. The compositional entropy feature was calculated by

taking the number of masked residues divided by the total length of the ORF.

5.1.2 Secondary structure feature extraction

We examine the ability of RNA folding for each of the transcripts using tools from the

Vienna package [49, 50]. We have the option of running either full secondary predictions

using RNAfold or to run local secondary structure using RNALfold. In the interest of

running time, we perform all our tests using local secondary structure prediction, with the

span size set to 150 bp.

From the output of these structure prediction programs, we extract the longest stem loop

by using a modified version from code available from Xue et al. [139]. This also gives us the

32 triplet-SVM features, which are 3 character motifs from the structure sequence made up

of dots (mismatches) and brackets (matches) for each of the four possible bases A, C, G, and

U. Once we extract the longest stem loop, we extract features for the stem length, minimum

free energy in hairpin, loop length, loop GC, asymmetric bulges, symmetric bulges, and the

longest bulge.


5.1.3 Genomic map based feature extraction

For non-de novo experiments, where we have the reference sequence available, we can observe

the splicing patterns of the transcript, and take account the number of exons as well as

their placement. For assembled contig sequences, genome coordinates are predicted using

BLAT [61] for each contig, mapped to the mouse mm9 (NCBI m37) reference genome. For

multiple genomic candidates, a single coordinate is chosen based on the highest score:

score = nmatch − nmismatch − nqueryinserts − ntargetinserts

Using the information from BLAT, the best alignment for each contig sequence can

then be used to predict the number of exons present as well its coordinate on the reference

genome.

Evolutionary conserved regions

Genomic conservation is used to score mapped regions of transcripts. This value is calculated

using Phastcons [118], the multi-species conservation algorithm. The values used were

based on the mm9 mouse model trained on 30 vertebrate species available from the UCSC

server [105]. The conservation scores taken from each individual base pairs from mapped

regions are used to calculate the mean conservation score across all exons, the proportion

of transcript with conservation, and number of exon blocks with conservation.

Histone expressed regions

Similar to the evolutionary conserved regions, mapped regions can be used to calculate

scores based any method that can be mapped to the reference genome. We apply this

method using signals derived from ChIP-seq profiles for histone modifications for H3 lysine

4 trimethylation signals on the an adult mouse liver library [108]. The score is calculated

as the number of aligned tags from a the Chip-Seq experiment divided by the overall length

of the transcript.

5.2 Classification

From a set of features extracted from a sequence, classification is performed to ultimately

predict the class of the set.


5.2.1 Support vector machine

We used LIBSVM [14] under the WEKA [43] WLSVM implementation [29]. Features were

extracted in the same way for both the training and testing datasets. Missing values were

replaced with weka.filters.unsupervised.attribute.ReplaceMissingValues, all entries were nor-

malised to values ranging from -1 to 1.

Cross validation

All cross validation experiments used five folds with the following settings: S = 0, K = 2,

D = 3, G = 0.0, R = 0.0, N = 0.5, M = 40.0, C = 1.0, E = 0.0010, P = 0.1, i, B.

Full contig classification

To classify the entire contig set, an SVM model was trained using a balanced subset of

contigs that mapped to Ensembl protein coding and the fRNAdb non-coding sequences

using 0.8 as the threshold cutoff. The SVM model was created using the same settings as

above. The resulting SVM model was applied to the feature set of all contigs using the

settings: p = 0, distribution.

Feature ranking

For feature ranking, the information gain ranking filter InfoGainAttributeEval with setting

x = 10 was used with search method Ranker with settings: T = 1.797 693 134 862 315 7× 10308,

N = 10.

Chapter 6

Experimental results

This chapter describes our experimental results. We first assess the performance of our

Sequence-Structure-Genome Classifier (SSGC) for the application reported in the literature:

binary coding vs. non-coding classification using sequences in annotated databases. We

compare the performance of our classifier and an alternative program, PORTRAIT [3].

We then report the results of our extensions. We extend binary classification to multiclass

classification of different types of non-coding RNA, and show that our classifier is potentially

useful in this setting. We present our findings for RNA-Seq data and de novo transcriptome

assembly, using seven datasets from a range of mouse tissues and developmental stages. We

quantify the expression level of annotated transcripts of a range of Ensembl biotypes using

a reads-to-genome mapping procedure, then determine how many annotated transcripts of

which biotypes map to assembled contigs, using a range of mapping score thresholds. We

note that the relatively low expression level of many types of non-coding RNAs may prevent

them from being efficiently sampled by de novo assembly. Our classifier takes as input a

collection of contigs, as well as the database elements that map to the contigs. From our cross

validation experiments, the performance of our classifier on these inputs are comparable to

its performance on biotype-annotated transcripts in public databases. SSGC is also applied

to the entire, mostly unlabelled, contig set and based on the p-value confidence scores, we

explore the ability to classify contigs and to find potentially novel coding and non-coding

entities. As we used more feature types than previously published binary classifiers, we

conclude the chapter by briefly evaluating our feature sets and examining which features

are important for binary and multiclass classification.

34

CHAPTER 6. EXPERIMENTAL RESULTS 35

6.1 Coding and non-coding databases

To assess the performance of the classifier, we obtained protein coding mRNA transcript

sequences and non-coding RNA sequences from a number of public databases. We com-

pared the effectiveness of our classifier with the competing method PORTRAIT [3], a high

performing classification method that computes features and uses a similar classification

model using LIBSVM.

In this section we present the results of our classifier applied to various sequence databases.

We then present our classifier results using multiple classes of non-coding RNA types. This

is done by performing all pairwise comparisons of non-coding RNA and then perform a

multiclass classification.

6.1.1 EMBL and Swissprot vs. non-coding

Preparation

For this dataset, protein coding mRNA sequences were obtained from a long sequence of

steps as first described in Arrial et. al [3]. 241,242 protein coding sequences were obtained

from Swissprot [11] release 51.0 31-October-2006. To reduce the number of similar and

over-represented protein sequences, we use CD-HIT [73] with cutoff set to 0.7, resulting in

118,398 entities. The Swissprot sequence IDs were used to obtain mRNA sequences from

EMBL [69] using the EBI DBIfetch tool. To further reduce the number of similar sequences,

BLASTCLUST [26] was run using the arguments p = F , S = 0.5, L = 0.5, W = 18. To

ensure compatibility with PORTRAIT, data sequences were restricted to lengths within 80

to 65,535 bp, resulting in a total of 53,834 mRNA sequences.

Non-coding RNAs were obtained from three databases, Rfam [36], RNADB [99], and

NONCODE [46]. Combined, there were 763,842 sequences. BLASTCLUST [26] was ap-

plied on the sequences with the same setting as for the protein coding sequences. Entries

outside the 80 to 65,535 bp range were removed. The total number of non-coding sequences

remaining was 60,849.

Classification

We compare our classifier with PORTRAIT [3], testing performance on the EMBL [69] and

Swissprot [11] datasets as protein coding and the combination of Rfam [36], RNADB [99],


and NONCODE [46]. Table 6.1 summarises the results. For the EMBL test set, PORTRAIT

outperforms our classifier by scoring higher in accuracy, precision and recall. From this result

we conclude that PORTRAIT is a better classifier for this dataset.

SSGC PORTRAITSize Accuracy Precision Recall Accuracy Precision Recall

1000 91.6 0.92 0.916 95.6 0.952 0.96010000 93.16 0.93 0.935 96.6 0.960 0.97350000 93.81 0.94 0.941 96.7 0.963 0.972

Wt. Ave. 93.7 0.93 0.940 96.7 0.962 0.972

Table 6.1: SSGC performance compared with PORTRAIT for the dataset composed ofSwiss-prot and EMBL for protein coding set, and Rfam, RNADB and NONCODE for thenon-coding set. Precision and recall are shown for the non-coding class.

6.1.2 Ensembl protein coding vs. non-coding

Preparation

To simulate the full length mRNAs found in transcriptome studies, we also look to mm9 mR-

NAs obtained from Ensembl v60 [55]. From the range of biotypes available from Ensembl,

we consider sequences with the biotype protein coding, consisting of 88,186 sequences. In

the same manner as in the EMBL dataset in the previous section, we performed BLAST-

CLUST [26] using the same arguments and restricted the sequences to the same size ranges,

resulting in 46,261 total sequences.

The same non-coding RNA dataset consisting of 60,849 sequences explained in the pre-

vious section was used.

Classification

SSGC was compared with PORTRAIT [3] using Ensembl v60 [55] protein coding transcripts

as the protein coding set, and the same non-coding RNA set as in section 6.1.1. The results

are summarised in Table 6.2. In this case, SSGC outperforms PORTRAIT in terms of

accuracy, precision and recall. The different in performance between this dataset and the

last is striking. As the same non-coding set is used, and transcripts are clustered and size-

selected for both, the difference between the inputs are likely that the EMBL sequences


contain purely the ORF containing portion of the mRNA while the Ensembl set contains

the full mRNA sequence including the UTRs. For the purpose of contig classification in the

transcriptome, we expect to see full-length mRNAs that include UTR sequences resemble

those in the Ensembl dataset.

SSGC PORTRAITSize Accuracy Precision Recall Accuracy Precision Recall

1000 93.4 0.93 0.938 87.3 0.892 0.84810000 92.28 0.92 0.932 89.0 0.905 0.87050000 92.92 0.92 0.937 89.3 0.909 0.873

Wt. Ave. 92.8 0.92 0.936 89.2 0.908 0.872

Table 6.2: SSGC performance compared with PORTRAIT for the dataset composed ofEnsembl protein coding, and Rfam, RNADB and NONCODE for the non-coding set. Pre-cision and recall are shown for the non-coding class.

6.1.3 Ensembl vs. fRNAdb

Preparation

To test the ability to distinguish a range of different non-coding RNA types, we look to

fRNAdb [63] for mouse mm9 sequences, downloaded March 1st, 2010. fRNAdb has in total

83,826 sequences divided into nine RNA types: fly-smallRNA, mat-miRNA, misc, piRNA,

pre-miRNA, rRNA, snoRNA, snRNA, and tRNA, containing 1664, 651, 31532, 48550, 597,

17, 735, 67, and 18 elements, respectively.

Protein coding sequences are made up of Ensembl v60 [63] with biotype protein coding

as before. To compare with smaller non-coding RNAs found in fRNAdb, no filtering was

performed based on similarity or size.

Classification

The previous sections presented our findings for the binary ‘coding vs. non-coding’ class

problem using exclusively de novo features. In this section we expand our methods to

incorporate two techniques: we compare the performance using the complete feature set

(which includes genome based features and the de novo feature set), and also to investigate

the multiclass problem by including several non-coding RNA types in our classification. Our


investigation is performed using datasets from Ensembl [55] protein coding and the multiple

non-coding RNA types from fRNAdb [63].

We investigate our classifier performance using the entire feature set and the de novo

feature set for the binary class using Ensembl and fRNAdb. Table 6.3 presents the per-

formance of the classification. Using the full feature set results in a slightly better overall

performance.

Table 6.4 represents the results for the pairwise binary classification between Ensembl

protein coding elements and each non-coding element found in fRNAdb, using both all

features and only de novo features. The resulting accuracies are high for each pair of RNA

elements; the misc class has the lowest performance in classification.

Features Accuracy Precision [nc] Recall [nc]

all 96.3 0.966 0.976de novo 95.6 0.961 0.97

Table 6.3: Binary classification performance between Ensembl protein coding with allfRNAdb non-coding sequences. The first row represents the experiment where all featuresare used. The second row represents the experiment where only the de novo features wereused.

In addition to the pairwise binary classification between protein coding sequences and

all non-coding RNA types, pairwise binary classification was performed on each pair of

non-coding RNA. Table 6.5 presents the result of our tests using all features, and Table 6.6

presents the tests using strictly de novo features. The number of samples per class varies

and likely causes fluctuations in the precision and recall but overall, the feature sets used

are promising in this binary classification problem.

In addition to the binary pairwise classification experiments, we performed multiclass

classifications between non-coding RNAs both with and without protein coding sequences.

Table 6.7 represents the confusion matrix of the multiclass classification for the nine non-

coding RNAs types found in fRNAdb. The higher numbers along the shaded diagonal cells,

the true positives, indicate the potential usage of our classifier to be used on multiple non-

coding RNAs. However, we do observe a skew in predictions towards RNA types that are

heavily represented in fRNAdb. Having small test sets for some RNA elements alongside

very large test sets indicates potential limitations in our current multiclass methodology.


Features Class 1 Class 2 Elements Accuracy Precision Recall[nc] [nc]

all

prot-coding fly-smallRNA 49789 99.9 0.980 0.996prot-coding mat-miRNA 48776 100.0 0.986 0.985prot-coding misc 79657 94.7 0.924 0.943prot-coding piRNA 96675 99.7 0.996 0.999prot-coding pre-miRNA 48722 99.7 0.911 0.807prot-coding rRNA 48142 100.0 1.000 0.765prot-coding snoRNA 48860 99.5 0.884 0.761prot-coding snRNA 48192 99.9 0.968 0.448prot-coding tRNA 48143 100.0 1.000 0.889

Average 57440 99.3 0.961 0.844

de novo

prot-coding fly-smallRNA 49789 99.9 0.980 0.996prot-coding mat-miRNA 48776 99.9 0.964 0.991prot-coding misc 79657 93.8 0.915 0.931prot-coding piRNA 96675 99.7 0.996 0.999prot-coding pre-miRNA 48722 99.6 0.862 0.762prot-coding rRNA 48142 100.0 1.000 0.765prot-coding snoRNA 48860 99.4 0.867 0.710prot-coding snRNA 48192 99.9 0.972 0.522prot-coding tRNA 48143 100.0 1.000 0.889

Average 57440 99.1 0.951 0.841

Table 6.4: Pairwise classification performance between Ensembl protein coding elements vs.each RNA type found in fRNAdb. The first half represents the results where all featuresare used. The second half represents the results where only de novo features were used,thereby excluding genome mapped information such as the number of exons and cross-species conservation scores.


Class 1 Class 2 Elements Accuracy Precision Recall[nc] [nc]

fly-smallRNA mat-miRNA 2315 94.4 0.960 0.963fly-smallRNA misc 33196 99.9 0.991 0.996fly-smallRNA piRNA 50214 99.2 0.877 0.890fly-smallRNA pre-miRNA 2261 100.0 1.000 1.000fly-smallRNA rRNA 1681 99.9 0.999 1.000fly-smallRNA snoRNA 2399 99.8 0.999 0.999fly-smallRNA snRNA 1731 99.9 0.999 1.000fly-smallRNA tRNA 1682 100.0 1.000 1.000mat-miRNA misc 32183 99.9 0.973 0.983mat-miRNA piRNA 49201 99.5 0.815 0.823mat-miRNA pre-miRNA 1248 100.0 1.000 1.000mat-miRNA rRNA 668 99.9 0.998 1.000mat-miRNA snoRNA 1386 99.9 0.997 1.000mat-miRNA snRNA 718 99.9 0.998 1.000mat-miRNA tRNA 669 100.0 1.000 1.000misc piRNA 80082 99.6 0.998 0.993misc pre-miRNA 32129 99.4 0.996 0.998misc rRNA 31549 100.0 1.000 1.000misc snoRNA 32267 99.1 0.994 0.997misc snRNA 31599 99.9 0.999 1.000misc tRNA 31550 100.0 1.000 1.000piRNA pre-miRNA 49147 100.0 1.000 1.000piRNA rRNA 48567 100.0 1.000 1.000piRNA snoRNA 49285 99.9 1.000 1.000piRNA snRNA 48617 100.0 1.000 1.000piRNA tRNA 48568 100.0 1.000 1.000pre-miRNA rRNA 614 99.5 0.995 1.000pre-miRNA snoRNA 1332 95.7 0.958 0.946pre-miRNA snRNA 664 98.6 0.990 0.995pre-miRNA tRNA 615 99.5 0.995 1.000rRNA snoRNA 752 98.9 1.000 0.529rRNA snRNA 84 95.2 1.000 0.765rRNA tRNA 35 97.1 0.944 1.000snoRNA snRNA 802 97.3 0.975 0.996snoRNA tRNA 753 99.3 0.996 0.997snRNA tRNA 85 97.6 0.985 0.985Average 18629 99.1 0.984 0.968

Table 6.5: Pairwise classification performance using the complete feature set for fRNAdbnon-coding RNA. Precision and recall are only shown for the second class.


Class 1 Class 2 Elements Accuracy Precision Recall[nc] [nc]

fly-smallRNA mat-miRNA 2315 93.3 0.953 0.954fly-smallRNA misc 33196 99.9 0.991 0.997fly-smallRNA piRNA 50214 98.3 0.814 0.635fly-smallRNA pre-miRNA 2261 99.9 0.999 1.000fly-smallRNA rRNA 1681 99.9 0.999 1.000fly-smallRNA snoRNA 2399 99.7 0.998 0.999fly-smallRNA snRNA 1731 99.9 0.999 1.000fly-smallRNA tRNA 1682 100.0 1.000 1.000mat-miRNA misc 32183 99.9 0.976 0.983mat-miRNA piRNA 49201 99.4 0.936 0.582mat-miRNA pre-miRNA 1248 100.0 1.000 1.000mat-miRNA rRNA 668 99.9 0.998 1.000mat-miRNA snoRNA 1386 99.6 0.992 1.000mat-miRNA snRNA 718 99.9 0.998 1.000mat-miRNA tRNA 669 100.0 1.000 1.000misc piRNA 80082 99.6 0.998 0.993misc pre-miRNA 32129 99.1 0.994 0.997misc rRNA 31549 100.0 1.000 1.000misc snoRNA 32267 98.7 0.990 0.997misc snRNA 31599 99.9 0.999 1.000misc tRNA 31550 100.0 1.000 1.000piRNA pre-miRNA 49147 99.9 0.999 0.999piRNA rRNA 48567 100.0 1.000 1.000piRNA snoRNA 49285 99.6 0.997 0.999piRNA snRNA 48617 99.9 0.999 1.000piRNA tRNA 48568 100.0 1.000 1.000pre-miRNA rRNA 614 99.0 0.990 1.000pre-miRNA snoRNA 1332 92.7 0.924 0.913pre-miRNA snRNA 664 97.1 0.975 0.993pre-miRNA tRNA 615 99.7 0.997 1.000rRNA snoRNA 752 98.9 1.000 0.529rRNA snRNA 84 98.8 1.000 0.941rRNA tRNA 35 97.1 0.944 1.000snoRNA snRNA 802 96.5 0.968 0.995snoRNA tRNA 753 99.3 0.996 0.997snRNA tRNA 85 97.6 0.985 0.985Average 18629 99.0 0.984 0.958

Table 6.6: Pairwise classification performance using de novo feature set for fRNAdb non-coding RNA, similar to Table 6.5. Precision and recall are only shown for the second class.


Despite this, the results suggest that our method is a good initial step in classifying among

different non-coding RNA sets. The limitation is possibly a subject of further study.

Classified as Class(prediction) (actual)

a b c d e f g h i

959 46 0 659 0 0 0 0 0 a66 320 0 265 0 0 0 0 0 b2 1 31206 233 18 0 72 0 0 c

207 73 63 48204 0 0 3 0 0 d0 0 109 6 460 0 22 0 0 e0 0 11 3 0 1 2 0 0 f0 0 226 58 30 0 419 2 0 g0 0 30 10 1 0 11 15 0 h0 0 2 3 1 0 0 0 12 i

Table 6.7: Confusion matrix for the multiclass classification using fRNAdb RNA types,using the entire feature set. The cells represent the number of predictions for each type,the shaded cells represent the number of true positives. Each RNA type is labelled from ato i, representing in order: fly-smallRNA, mat-miRNA, misc, piRNA, pre-miRNA, rRNA,snoRNA, snRNA and tRNA.

6.2 The RNA-Seq dataset

Classification was performed on data derived from transcriptome sequencing experiments,

using contig sets created using the Trans-ABySS [110] pipeline.

In our analysis, we first examine the representation of coding and non-coding RNA

transcripts represented by RNA-Seq reads. This is done using two methods: a genome

mapping procedure that measures read coverage on annotated locations of Ensembl and

fRNAdb elements, then a direct mapping from assembled contig to annotation using a

range of mapping thresholds. Our results ultimately show that there are non-coding RNAs

represented as contigs, but that there are too few non-coding RNA types represented to

support multiclass classification. We continue our investigation on contig classification using

the binary ’protein coding vs. non-coding’ classes.


6.2.1 Contig preparation

Contig sets were generated from six RNA-Seq libraries MM0490, MM0564, MM0566, MM0570,

MM0571, and MM0581. Each library consists of 50 bp paired-end poly(A)+ RNA as de-

scribed in Robertson et al. [110] These six libraries represents various developmental stages

and tissue types of C57BL/6J mouse. Table 6.8 lists the libraries along with their tissue of

origin, age, and the number of transcription reads sequenced.

Library Tissue Age Reads

MM0490 Liver E14.5 157MMM0564 Heart-Atrioventricular-Cushions E12.5 229MMM0566 Heart-Atrioventricular-Cushions E11.5 257MMM0570 Dorsal Aorta E11.5 217MMM0571 U and V Aorta E14.5 235MMM0581 Endoderm-Definitive E8.5 250M

Table 6.8: Six seven-lane RNA-Seq mouse libraries were exained.

6.2.2 Transcriptome reads mapped to the genome

We map the transcriptome reads to the mouse mm9 genome and calculate the read coverage

using the coordinates of each annotated element. This is done by mapping each read using

BWA [70] and SAMtools [71] to a modified mouse genome, one that contains pre-spliced

junctions between possible exon pairs as described in Morin et al. [90]. For this study,

these steps are taken for the Ensembl [55] v60 annotation for the mouse. Exon-exon junc-

tion coordinates are defined from Ensembl [55], Refseq [103] and UCSC known gene [53]

annotations.

The transcriptome reads are mapped to the genome and the coverage is calculated for

each annotation in Ensembl v60. Figures 6.1 and 6.2 show the breakdown of read coverage

for a set of non-coding RNA-related biotype annotations using MM0564 reads. Protein

coding annotations are well expressed, as expected, but the non-coding annotations have

varying amounts of coverage. Assembling transcripts de novo from an RNA-Seq experiment

requires higher read coverage than reference based methods [110]. From this mapping

experiment alone it is unclear what fraction of different non-coding biotypes will be available

as assembled contigs.


1e−03 1e+01 1e+05

0.0

0.4

0.8

protein_coding

x

1e−03 1e+01 1e+05

0.0

0.4

0.8

lincRNA

x

Fn(x

)

1e−03 1e+01 1e+05

0.0

0.4

0.8

miRNA

xFn

(x)

1e−03 1e+01 1e+05

0.0

0.4

0.8

misc_RNA

x

Fn(x

)

1e−03 1e+01 1e+05

0.0

0.4

0.8

pseudogene

x

1e−03 1e+01 1e+05

0.0

0.4

0.8

rRNA

x

Fn(x

)

1e−03 1e+01 1e+05

0.0

0.4

0.8

snoRNA

x

Fn(x

)

1e−03 1e+01 1e+05

0.0

0.4

0.8

snRNA

xFn

(x)

protein_coding

−2 0 2 4

010

0025

00

lincRNA

−2 0 2 4

020

40

miRNA

−2 0 2 4

020

6010

0

misc_RNA

−2 0 2 4

05

1525

pseudogene

−2 0 2 4

020

040

0

rRNA

−2 0 2 4

05

1015

snoRNA

−2 0 2 4

020

6010

0

snRNA

−2 0 2 4

020

60

Distribution of transcript coverages for library MM0564

Figure 6.1: Read coverage for Ensembl broken down to biotypes, for RNA-Seq reads fromlibrary MM0564. Each biotype is represented as an ECDF and as a distribution of log10

read coverage.


1e−03 1e−01 1e+01 1e+03 1e+05

0.0

0.2

0.4

0.6

0.8

1.0

ECDF of Ensembl v60 transcript readcoverage for RNA−Seq library MM0564

read coverage

cum

ulat

ive

fract

ion

protein_codinglincRNAmiRNAmisc_RNApseudogenerRNAsnoRNAsnRNA

Figure 6.2: Empirical cumulative distribution function representing the read coverage for aselect number of Ensembl biotypes mapped to the mm9 reference genome from Figure 6.1.


6.2.3 Contig assembly and merging

Each RNA-Seq library was assembled and merged using Trans-ABySS [110], assembling the

reads for every even k-mer between 26 to 50, producing a set of contigs for each library.

One of the issues with de Bruijn based assemblers is that depending on the coverage and

the k-mer length k, this can lead to very fragmented and overlapping contigs. Here we

processed the contig sets using the contig merging method [110]. To prevent the potential

exclusion of non-coding RNAs in the merged dataset, we examined merged contig sets with

filtering turned both on and off. The resulting set of contigs are summarised in Table 6.9.

Number Min Max Ave. Med.Filter Library Reads of contigs size size size size N50

yes

MM0490 157,441,166 5,701,316 25 71,739 86.4 46 91MM0564 229,499,055 2,450,369 25 58,854 228.4 59 1,249MM0566 257,298,896 2,742,649 25 63,266 210.3 57 1,121MM0570 217,279,470 3,806,318 25 21,614 127.5 55 318MM0571 235,143,912 2,402,290 25 21,519 173.5 61 636MM0581 249,969,333 4,090,155 25 54,048 212.3 54 967

no

MM0490 157,441,166 36,277,159 26 71,739 53.7 35 44MM0564 229,499,055 20,198,978 26 63,440 75.2 35 163MM0566 257,298,896 23,935,262 26 63,266 71.7 36 94MM0570 217,279,470 37,449,860 26 21,614 51.8 37 45MM0571 235,143,912 32,104,934 26 21,646 52.8 38 45MM0581 249,969,333 29,613,817 26 56,733 91.7 37 407

Table 6.9: Six seven-lane RNA-Seq libraries were assembled, merged to create the contigsets. These contigs were used as input for the classifier.

6.2.4 Contig to annotation mapping

The unfiltered contig set from each library was mapped to known protein coding mRNAs and

non-coding RNAs found in the databases Ensembl and fRNAdb using a range of thresholds

from 0.7 to 1.0. Figure 6.3 represents the number of contigs that map to annotated protein

coding and non-coding elements set with different thresholds for filtered and unfiltered

contigs, repspectively. From this figure we make a number of observations. First, the number

of fRNAdb non-coding elements are mapped in lower quantities than Ensemble types, but is

still in the order of hundreds and are likely sufficient for classification experiments. Second,


comparison between filtered and unfiltered contigs show that filtering appears to affect

non-coding RNA sequences in fRNAdb but not Ensembl sequence. Third, as the mapping

threshold increases, the number of annotated contigs drops quite uniformly for both coding

and non-coding transcripts; it is therefore not obvious whether a single threshold value is

practical to perform all our mapping and is a possible topic of future work.

We further investigate both coding and non-coding annotation sets by breaking down

individual biotypes (Figure 6.4) and non-coding RNA families (Figure 6.5). From these two

figures, it is evident that not all types are represented in this mapping, indicating that either

their transcripts are not mapped well with the contig set, or are not present at high enough

levels in the RNA-Seq library, given the protocol and sequencing depth. From Figure 6.5,

for thresholds between 0.7 and 1.0, there are not enough individual RNA types found in the

fRNAdb dataset to perform pairwise or multiclass RNA classification as was done in section

6.1.3. For classification using contig sets, we focus on the binary coding vs. non-coding

classification problem.

6.2.5 Contig cross validation

Feature values were computed from contig sequences in the same manner for database

annotated sequences in earlier sections. We have performed the mapping, feature extraction

and classification on all six transcriptome libraries. All have resulted in similar findings and

performance and for the interest of space and to avoid repetition, we choose not to include

all the results in this thesis.

Contigs were mapped to protein coding or non-coding sequences by using the mapping

criteria in section 6.2.4, resulting in a sets of contigs labelled as Ensemble protein coding

RNAs and fRNAdb non-coding RNAs, for a range of mapping thresholds from 0.7 to 1.0.

We performed binary classification between the labelled contigs. Table 6.10 summarises the

classification performances between labelled contigs derived from library MM0564 in the top

half. We also performed the same classification using the original annotation sequence that

each contig represented, presented as ‘DB elements’ in the lower half of the table. For these

experiments, the total accuracaies are quite consistent for both sets. Also, the precision and

recall of the non-coding sequences are low. This is most likely caused by the difference in

sample size, as there are more coding contigs than non-coding sets.

To avoid the effect on performance due to differences in sample sizes between the two

classes, a stratified test set is made so that each class is equal in size. Table 6.11 shows


●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0.70 0.75 0.80 0.85 0.90 0.95 1.00

050

0010

000

1500

020

000

2500

0

Ensembl / filtered mapped

BLAT alignment thresholds

Num

ber o

f ann

otat

ions

MM0490 ●

MM0564MM0566MM0570MM0571MM0581

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0.70 0.75 0.80 0.85 0.90 0.95 1.00

050

0010

000

1500

020

000

2500

0

Ensembl / unfiltered mapped

BLAT alignment thresholdsN

umbe

r of a

nnot

atio

ns

MM0490 ●

MM0564MM0566MM0570MM0571MM0581

a) b)

●●

●●

●●

●●

●●

●●

●●

● ●

0.70 0.75 0.80 0.85 0.90 0.95 1.00

050

010

0015

0020

0025

0030

0035

00

fRNAdb / filtered mapped


Num

ber o

f ann

otat

ions

MM0490 ●

MM0564MM0566MM0570MM0571MM0581

●

●

●

●

●

●

●

●●

●●

●●

●● ●

0.70 0.75 0.80 0.85 0.90 0.95 1.00

050

010

0015

0020

0025

0030

0035

00

fRNAdb / unfiltered mapped


Num

ber o

f ann

otat

ions

MM0490 ●

MM0564MM0566MM0570MM0571MM0581

c) d)

Figure 6.3: Number of unique contigs that map to the sequence annotation databasesfRNAdb and Ensembl using a range of mapping thresholds for all six mouse libraries. (a)and (c) represent the filtered contig set mappings, (b) and (d) represent the unfiltered contigset mappings.


Source Threshold Elements Accuracy Precision Recall[nc] [nc] [nc]

Contigs

0.70 20981 95.4 0.857 0.7540.72 19984 95.6 0.855 0.7520.74 19047 95.9 0.865 0.7580.76 18102 96.1 0.869 0.7510.78 17160 96.5 0.874 0.7590.80 16299 96.7 0.881 0.7600.82 15322 96.8 0.883 0.7370.84 14418 96.9 0.885 0.7320.86 13442 97.0 0.886 0.7150.88 12371 96.8 0.872 0.6620.90 11126 97.0 0.865 0.6460.92 9765 97.3 0.855 0.6690.94 8175 97.8 0.852 0.7070.96 6243 98.1 0.875 0.7570.98 3638 98.7 0.879 0.8361.00 77 97.4 0.974 1.00

Average 12884 96.9 0.877 0.750

DB elements

0.70 21087 96.3 0.868 0.8400.72 20084 96.4 0.867 0.8350.74 19139 96.6 0.869 0.8380.76 18191 96.8 0.871 0.8320.78 17243 97.0 0.867 0.8370.80 16381 97.0 0.866 0.8310.82 15400 97.1 0.870 0.8150.84 14491 97.3 0.868 0.8140.86 13504 97.4 0.865 0.8100.88 12427 97.5 0.857 0.8010.90 11176 97.8 0.865 0.8110.92 9812 98.0 0.868 0.8130.94 8210 98.1 0.861 0.7970.96 6277 98.1 0.878 0.7780.98 3655 98.1 0.891 0.6871.00 91 100.0 1.00 1.00

Average 12948 97.5 0.877 0.821

Table 6.10: Classification performance using the contigs from the library MM0564, usingthe full feature set. The contig sets are mapped to protein coding sequences from Ensembl,and non-coding RNA sets from fRNAdb using a series of mapping thresholds. The top halfof the table represents the classification results using features extracted from the contigsequences. The lower half represents the classification results using the features extractedfrom the original sequence from either Ensembl or fRNAdb that each contig mapped to.


0.70 0.75 0.80 0.85 0.90 0.95 1.00

110

100

1000

1000

0Ensembl filtered MM0564 contigs


Num

ber o

f ann

otat

ions

snRNA ●

snoRNArRNA

pseudogenemisc_RNA

miRNAlincRNA

protein_coding

0.70 0.75 0.80 0.85 0.90 0.95 1.00

110

100

1000

1000

0

Ensembl unfiltered MM0564 contigs


Num

ber o

f ann

otat

ions

snRNA ●

snoRNArRNA

pseudogenemisc_RNA

miRNAlincRNA

protein_coding

a) b)

Figure 6.4: Ensembl transcripts mapped by filtered (a) and unfiltered (b) MM0564 contigs,broken down into individual biotypes.

the performance of the classifier on this stratified set for the same contigs. In comparison

to Table 6.10 it is evident that the accuracy decreases slightly, but, at the same time, the

precision and recall rise to comparable levels with the accuracy.

The underlying difference in classification performance for the different threshold values

is not immediately clear. It is not clear whether this trend is a result of the rising threshold

values or simply due to the decrease in the number of elements tested. However, we note that

accuracy increases for the contigs as the threshold increases, while the database elements

do not change to the same degree. This suggests that the number of elements in the test

set is not responsible for the difference in performance. The only difference between these

values is the quality of the sequences, determined by the threshold values. Comparing the

performance between the contigs and the database elements shows that they converge to

approximately as the threshold goes to 1.0 (both to 96% in Table 6.11). Lower thresholds

produce lower classification results. This suggests that higher thresholds force the mapped

contigs to resemble real coding and non-coding sequences, improving the performance of the

classifier. But at the same time as the threshold increases there are fewer elements to train


0.70 0.75 0.80 0.85 0.90 0.95 1.00

15

1050

100

500

fRNAdb filtered MM0564 contigs


Num

ber o

f ann

otat

ions

flysmallRNA ●

matmiRNAmisc

piRNApremiRNA

rRNAsnoRNAsnRNA

tRNA

● ●● ●

● ●

● ●

● ●

● ● ● ●

0.70 0.75 0.80 0.85 0.90 0.95 1.00

15

1050

100

500

fRNAdb unfiltered MM0564 contigs


Num

ber o

f ann

otat

ions

flysmallRNA ●

matmiRNAmisc

piRNApremiRNA

rRNAsnoRNAsnRNA

tRNA

a) b)

Figure 6.5: fRNAdb transcripts mapped by filtered (a) and unfiltered (b) MM0564 contigs,broken down into individual RNA types.

and test the classifier. From these observations, it again shows the difficulty in choosing

a suitable value or a set of values for the threshold. This is a major issue that must be

considered in order to perform the classification for raw contig sequences.

PORTRAIT was also used on the contig sets and the database annotations in the same

way that our classifier was used. Feature computation was not possible for the contig sets

due to software errors. However, we were able to extract the features from the database

elements mapped to the contigs. The results on the database elements for SSGC and

PORTRAIT are compared in Table 6.12. The accuracy is comparable for both methods in

the unbalanced set but are quite different for the stratified set where the protein coding and

non-coding elements were equal. This again illustrates the effect of unbalanced class sizes

in our dataset.

6.2.6 Full contig set classification

The cross-validation experiments in the previous sections were applied to labelled data sets.

From the tens of millions of contigs produced in the assembly, only tens of thousands were


Source Threshold Elements Accuracy Precision Recall[nc] [nc] [nc]

Contigs - Strat

0.70 5226 93.4 0.933 0.9350.72 4738 92.8 0.927 0.9290.74 4308 94.1 0.942 0.9390.76 3870 93.3 0.933 0.9330.78 3462 93.8 0.936 0.9400.80 3124 94.2 0.943 0.9400.82 2734 94.0 0.936 0.9440.84 2438 94.4 0.944 0.9430.86 2120 94.7 0.950 0.9430.88 1814 94.2 0.940 0.9450.90 1484 94.6 0.947 0.9450.92 1196 95.3 0.956 0.9500.94 860 93.6 0.937 0.9350.96 668 94.6 0.954 0.9370.98 330 96.1 0.987 0.9331.00 4 - - -

Average 2558 94.2 0.944 0.939

DB - Strat

0.70 5398 95.3 0.946 0.9600.72 4900 95.1 0.943 0.9600.74 4462 95.6 0.949 0.9630.76 4018 95.7 0.951 0.9640.78 3602 95.3 0.946 0.9600.80 3262 95.2 0.945 0.9610.82 2864 95.6 0.948 0.9640.84 2558 96.2 0.955 0.9710.86 2222 96.4 0.959 0.9700.88 1904 95.9 0.950 0.9670.90 1566 95.9 0.957 0.9620.92 1274 95.5 0.956 0.9540.94 916 95.2 0.956 0.9480.96 722 96.1 0.961 0.9610.98 358 96.4 0.956 0.9721.00 4 - - -

Average 2502 95.7 0.952 0.962

Table 6.11: Classification performance for the stratified contigs from library MM0564, usingthe full feature set. In comparison to Table 6.10, the number of elements in each class areequal. The contig sets are mapped to protein coding sequences from Ensembl, and non-coding RNA sets from fRNAdb using a series of mapping thresholds. The top half of thetable represents the classification results using features extracted from the contig sequences.The lower half represents the classification results using the features extracted from theoriginal sequence from either Ensembl or fRNAdb that each contig mapped to. Note thatfor thresholds at 1.0, there are not enough elements to perform classification.


SSGC PORTRAITType Threshold Elements Acc. Prec Recall Acc. Prec Recall

All

0.70 21087/20669 96.3 0.868 0.840 96.2 0.969 0.9890.72 20084/19689 96.4 0.867 0.835 96.4 0.970 0.9900.74 19139/18765 96.6 0.869 0.838 96.5 0.971 0.9900.76 18191/17827 96.8 0.871 0.832 96.6 0.973 0.9910.78 17243/16901 97.0 0.867 0.837 96.8 0.974 0.9920.80 16381/16062 97.0 0.866 0.831 97.1 0.976 0.9930.82 15400/15111 97.1 0.870 0.815 97.1 0.975 0.9940.84 14491/14224 97.3 0.868 0.814 97.2 0.975 0.9950.86 13504/13259 97.4 0.865 0.810 97.3 0.976 0.9960.88 12427/12197 97.5 0.857 0.801 97.3 0.976 0.9960.90 11176/10967 97.8 0.865 0.811 97.6 0.978 0.9970.92 9812/9621 98.0 0.868 0.813 97.5 0.977 0.9970.94 8210/8060 98.1 0.861 0.797 97.9 0.980 0.9980.96 6277/6124 98.1 0.878 0.778 98.2 0.983 0.9990.98 3655/3570 98.1 0.891 0.687 98.6 0.987 0.9991.00 91/6 100.0 1.000 1.000 100.0 1.000 1.000

Average 97.5 0.877 0.821 97.4 0.978 0.995

Strat

0.70 5398/4584 95.3 0.946 0.960 91.4 0.908 0.9200.72 4900/4132 95.1 0.943 0.960 91.2 0.902 0.9250.74 4462/3728 95.6 0.949 0.963 91.3 0.908 0.9180.76 4018/3304 95.7 0.951 0.964 91.3 0.909 0.9180.78 3602/2926 95.3 0.946 0.960 91.0 0.903 0.9190.80 3262/2630 95.2 0.945 0.961 91.1 0.905 0.9190.82 2864/2292 95.6 0.948 0.964 91.0 0.900 0.9210.84 2558/2030 96.2 0.955 0.971 91.1 0.901 0.9240.86 2222/1738 96.4 0.959 0.970 89.8 0.895 0.9020.88 1904/1450 95.9 0.950 0.967 89.4 0.886 0.9050.90 1566/1154 95.9 0.957 0.962 89.2 0.879 0.9080.92 1274/894 95.5 0.956 0.954 87.8 0.867 0.8930.94 916/616 95.2 0.956 0.948 88.6 0.881 0.8930.96 722/416 96.1 0.961 0.961 89.4 0.890 0.8990.98 358/188 96.4 0.956 0.972 93.1 0.926 0.9361.00 4/4

Average 95.7 0.952 0.962 90.5 0.897 0.913

Table 6.12: Classification performance for the database sequences mapped by the unfilteredcontig sets from MM0564; each classification is compared with PORTRAIT. The precisionand recall is only shown for the non-coding class. We were not able to compare the clas-sification accuracies for the actual contig sets themselves. Note the number of elements islower for PORTRAIT due to the size restrictions for their input.


used in the cross validation experiments. In this section, we investigate the use of SSGC

applied on the full contig set. From the unannotated contig sequences, we attempt to use

the classifier predictions to find potential novel non-coding and protein coding transcripts

in the data.

We created an SVM model from 3124 annotated contig sequences that represent both

classes, in equal proportions, from the mouse library MM0564 using 0.8 as the mapping

threshold. The SVM model was applied on the entire contig set to obtain a class prediction

and a confidence value, the p-value (Figure 4.3).

Each contig is assigned a p-value from [0,1], where a value below 0.5 is classified as

protein coding and a value above 0.5 is classified as non-coding. Figure 6.6 represents the

distribution of contig predictions as well as the p-values. The p-values are skewed towards

non-coding values which have very high values, suggesting that the vast majority of the

assembled contigs are strongly non-coding. Figure 6.7 represents the mapping threshold

scores and sizes of contigs that are at either extreme of the p-value distribution, and therefore

likely non-coding or protein coding. We looked closely at possible novel non-coding and

protein coding contigs by examining sequences with p-values above 0.95 or below 0.05, and

that do not map to any known mm9 mouse fRNAdb and Ensembl protein coding sequences

using a BLAT alignment.

Our analysis of potential non-coding contigs, shows that many are found in intronic and

UTR regions of known genes. Using the UCSC Genome Browser [62], Figure 6.8 represents

one such contig, k50:177614, with p-value of 1.0, and has no BLAT alignments with any

sequences in fRNAdb and Ensembl protein coding. It is likely that this sequence is located

within a novel polyadenylation tail of the gene Fstl4. Although there is no evidence of the

sequence being functional, its location in the 3′ tail suggests that the classifier was correct

in classifying the contig as non-coding.

Figure 6.9 represents the alignment of contig k29:3267973 to the mm9 mouse genome.

The aligned RNA-Seq reads show pileups that resemble a spliced gene. In addition, the

exonic regions are highly conserved across some species. Figure 6.10 shows the contig with

the mouse sequence coordinate lifted from the mouse mm9 genome to the human hg18

genome using the UCSC LiftOver tool [62]. From the viewer, it is evident that one of the

exons is aligned to the AceView Gene Model glertee.aApr07. This suggests that the classifier

was correct in classifying the contig as protein coding.

Our analysis shows that many potential novel protein coding contigs are aligned to


protein coding non−coding

MM0564 contig predictions0.

0e+0

05.

0e+0

61.

0e+0

71.

5e+0

72.

0e+0

7

non−coding RNA p−valuenu

mbe

r of c

ontig

s

0.0 0.2 0.4 0.6 0.8 1.0

0.0e

+00

5.0e

+06

1.0e

+07

1.5e

+07

0.0e

+00

5.0e

+06

1.0e

+07

1.5e

+07

a) b)

Contigs with no alignments

non−coding RNA p−value

num

ber o

f con

tigs

0.0 0.2 0.4 0.6 0.8 1.0

0e+0

01e

+06

2e+0

63e

+06

4e+0

65e

+06

6e+0

60e

+00

1e+0

62e

+06

3e+0

64e

+06

5e+0

66e

+06

Contigs ≥≥ 500bp

non−coding RNA p−value

num

ber o

f con

tigs

0.0 0.2 0.4 0.6 0.8 1.0

020

000

4000

060

000

8000

010

0000

1200

000

2000

040

000

6000

080

000

1000

0012

0000

c) d)

Figure 6.6: The full MM0564 contig set is predicted by the SVM model, and are assignedprobabilities. Contigs with p-values below 0.5 are classified as protein coding, while contigswith p-values above 0.5 are classified as non-coding. (a) is the class prediction for allcontigs. (b) is the p-value distribution of all the contigs, (c) is the p-value of contigs withno alignments to any known non-coding transcripts. (d) is the p-value for all contigs 500bpand larger.


0.0 0.5 1.0 1.5 2.0

200

500

1000

2000

5000

2000

050

000

Contigs / p−value ≤≤ 0.05

protein coding mapping scores

num

ber o

f con

tigs

(log)

200

500

1000

2000

5000

2000

050

000

0.0 0.5 1.0 1.5 2.0

1e+0

31e

+04

1e+0

51e

+06

1e+0

7

Contigs / p−value ≥≥ 0.95

non−coding RNA mapping scoresnu

mbe

r of c

ontig

s (lo

g)1e

+03

1e+0

41e

+05

1e+0

61e

+07

a) b)

0 10000 20000 30000 40000 50000 60000

1e+0

01e

+02

1e+0

41e

+06

p−value ≤≤ 0.05 and no mapping

contig size (bp)

num

ber o

f con

tigs

(log)

1e+0

01e

+02

1e+0

41e

+06

1e+0

01e

+02

1e+0

41e

+06

0 10000 20000 30000 40000 50000 60000

1e+0

01e

+02

1e+0

41e

+06

p−value ≥≥ 0.95 and no mapping

contig size (bp)

num

ber o

f con

tigs

(log)

1e+0

01e

+02

1e+0

41e

+06

1e+0

01e

+02

1e+0

41e

+06

c) d)

Figure 6.7: Mapping scores and sizes of contigs strongly predicted as protein coding (p-value ≤ 0.05) and non-coding (p-value ≥ 0.95). a,b) Distribution of mapping scores withthe best-aligned a) protein-coding Ensembl sequence, b) non-coding fRNAdb sequence. c,d)Distribution of contig sizes (white). In (c), the red regions represent strongly protein coding(p-value ≤ 0.05) which do not map to any known sequences in Ensembl or fRNAdb. In (d),the orange regions represent strongly non-coding (p-value ≥ 0.95) which do not map to anyknown sequences.


Scalechr11:

STS Markers

RefSeq Genes

Other RefSeq

Ensembl Genes

Spliced ESTs

RatHuman

OrangutanDog

HorseOpossum

ChickenStickleback

SNPs (128)

RepeatMasker

10 kb52995000 53000000 53005000 53010000 53015000

Contigs MM0564 BLAT chr11

MM0490 WTSS: E14.5 liver

MM0564 WTSS: E12.5 heart AV cushion


MM0570 WTSS: E11.5 dorsal aorta

MM0571 WTSS: E14.5 umbilical & vitelline artery

MM0581 WTSS: E8.5 definitive endoderm

STS Markers on Genetic and Radiation Hybrid Maps

Your Sequence from Blat Search

UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics

RefSeq Genes

Non-Mouse RefSeq Genes

Ensembl Genes

Human Proteins Mapped by Chained tBLASTn

Mouse mRNAs from GenBank

Mouse ESTs That Have Been Spliced

Placental Mammal Basewise Conservation by PhyloP

Multiz Alignments of 30 Vertebrates

Simple Nucleotide Polymorphisms (dbSNP build 128)

Repeating Elements by RepeatMasker

k50:177614

Fstl4

FSTL4FSTL4

TAF13

AK046350AK081114AF374459BC132353BC144824AK220367

AK204007AK200446

BC018609

MM0490_7L3 _

0 _

MM0564_7L226 _

0 _

MM0566_7L271 _

0 _

MM0570_7L18 _

0 _

MM0571_7L32 _

0 _

MM0581_7L13 _

0 _

Mammal Cons

2.1 _

-3.3 _

0 -

Figure 6.8: Contig k50:177614 aligned in the mouse mm9 genome. The top track representsthe multiple contigs that are mapped to this location. The second set of tracks are thepileups for the RNA-Seq read alignments for the six mouse transcriptome libraries. Belowthe contig track is the gene track and the conservation track. This contig has a p-value of1.0 and does not map to any known non-coding or protein coding sequences.


transcripts that are similar to previously known protein coding sequences, which are not yet

labelled as protein coding in the Ensembl database.

From these two simple examples, we demonstrate the ability of SSGC to detect potential

novel coding and non-coding contigs from the full contig set. From manual inspection,

sequences on either extremes of the p-value distribution do resemble real non-coding and

protein coding elements. However, SSGC’s ability as a gene finder, especially for novel

sequences, is potentially useful but is currently limited. For practical use, it would be

desirable to be able to distinguish a real transcript from an artifact from assembly, and to

distinguish functional from non-functional non-coding RNAs.

6.3 Feature ranking

We also investigate the effectiveness of the features used in the classification experiments

by ranking features for different conditions. Table 6.13 show the top twenty ranked features

for the classification experiments between Ensembl protein coding and fRNAdb non-coding

sequences as in section 6.1.3.

The first two columns represents the ranked features used in the binary classification

between coding and non-coding. ORF-related features are prevalent in the list, which is

understandable as non-coding sequences are not expected to have ORF sequences. We also

see the importance of the trigrams TAG and TAA in the first four columns. These are two

of the three stop codons within an ORF. We can also observe that a number of features

not available in the de novo set are important for this binary classification. This is again

understandable as we would expect the number of exons be important in identifying non-

coding RNAs. Conservation is also represented, further supporting the notion that protein

coding sequences are much better conserved than non-coding sequences in the genome.

The multiclass experiments are shown in the middle and the last pair of columns. We ob-

serve that once protein coding sequences are removed from the classifier (last two columns),

new features emerge in the list, notably for length and secondary structure. The length is

a key feature used to distinguish some of the smaller sized from the larger sized non-coding

RNAs. The secondary structure based feature ‘Total energy’ likely plays a larger role as

some RNA types are known to have very distinct confirmations.

We also examine the effectiveness in classification using subsets of the top-ranked features

using the information gain ranking filter. Table 6.14 represents the performance for the


Scalechr7:

STS Markers

RefSeq Genes

Other RefSeq

Ensembl Genes

Spliced ESTs

RatHuman

OrangutanDog

HorseOpossum

ChickenStickleback

SNPs (128)

RepeatMasker

10 kb20045000 20050000 20055000 20060000 20065000 20070000

Contigs MM0564 BLAT chr7

MM0490 WTSS: E14.5 liver



MM0570 WTSS: E11.5 dorsal aorta

MM0571 WTSS: E14.5 umbilical & vitelline artery

MM0581 WTSS: E8.5 definitive endoderm

STS Markers on Genetic and Radiation Hybrid Maps



RefSeq Genes

Non-Mouse RefSeq Genes

Ensembl Genes

Human Proteins Mapped by Chained tBLASTn

Mouse mRNAs from GenBank

Mouse ESTs That Have Been Spliced





k29:3267973

Mark4

MARK4MARK4MARK4

RPL34 XTP7

AK146784AY151083BC156720

MM0490_7L30 _

0 _

MM0564_7L186 _

0 _

MM0566_7L168 _

0 _

MM0570_7L560 _

0 _

MM0571_7L213 _

0 _

MM0581_7L97 _

0 _

Mammal Cons

2.1 _

-3.3 _

0 -

Figure 6.9: Contig k29:3267973 aligned in the mouse mm9 genome. Similar to Figure 6.8,the tracks represent the assembled contigs, RNA-Seq read pileups, the contig, known geneannotations, and conservation. This contig has a p-value of 0 and does not map to anyknown non-coding or protein coding sequences.


Scalechr19:

k29:3267973

RefSeq Genes

Other RefSeq

Human mRNAs

Spliced ESTs

RhesusMouse

DogElephant

OpossumPlatypusChicken

LizardX_tropicalisStickleback

SNPs (130)

RepeatMasker

10 kb50425000 50430000 50435000 50440000 50445000 50450000

mm9 Lift Over



RefSeq Genes

Non-Human RefSeq Genes

Ensembl Genes

AceView Gene Models With Alt-Splicing

Non-coding RNA Genes (dark) and Pseudogenes (light)Human mRNAs from GenBank

Human ESTs That Have Been Spliced

ENCODE Enhancer- and Promoter-Associated Histone Mark (H3K4Me1) on 8 Cell Lines

ENCODE Enhancer- and Promoter-Associated Histone Mark (H3K27Ac) on 8 Cell Lines

ENCODE Promoter-Associated Histone Mark (H3K4Me3) on 9 Cell Lines

ENCODE Digital DNaseI Hypersensitivity Clusters

ENCODE Transcription Factor ChIP-seq





k29:3267973

EXOC3L2 MARK4MARK4MARK4

ENST00000252482 ENST00000262893ENST00000300843ENST00000262891ENST00000377820

EXOC3L2.aApr07glertee.aApr07

MARK4.aApr07MARK4.bApr07

MARK4.hApr07gasee.aApr07-unspliced

6 51

3 2 37

3 465

554

2

23

2012

2

2

gPU.1 gPOU2F2KEgr-1

KHEY1KSTAT2PFOXP2KEgr-1

LHNF4AGSP1ACTCFGKRad21ggNFKB

GPAX5-C20KBrg1

HIni1

HBAF155

LKHEY1GPOU2F2

GEBFGEBF

GPU.1LKUSF-1GPAX5-C20GTCF12

GEBF

GEBF

Enhanced H3K4Me1

Enhanced H3K27Ac50 _

0 _

Promoter H3K4Me3

Mammal Cons

3 _

-0.5 _

Figure 6.10: Contig k29:3267973 (from Figure 6.9) represented in the human hg18 genome,using the LiftOver tool from the UCSC Genome Browser [62]. The tracks represent thecontig coordinate (from the LiftOver), the contig BLAT alignment, known human genemodels, histone modification tracks, and the conservation.


Coding vs. non-coding Multiclass (Prot + RNA) Multiclass (RNA)Rank All features de novo All features de novo All features de novo1 ORF pro-

portionORF pro-portion

ORF pro-portion

ORF pro-portion

conserv-Num-bases-cov

length-(bp)

2 ORF-size ORF-size length-(bp) length-(bp) histones-Num-bases-cov

Total-energy

3 TAG TAG Conservedareas withcoverage

TAG length-(bp) length

4 ORF score ORF score TAG ORF-size conserv-Num-bases

ORF pro-portion

5 Number ofexons (h)

CG Histoneswith cover-age

TA Total-energy

TG

6 Number ofexons (c)

TA ORF-size ORF score length GA

7 CG CGA Bases withconserva-tion

T ORF pro-portion

GT

8 Conservedexons

TTA TA TT TG GC-content

9 TA TAA ORF score CG GA G10 Conservation

scoreaaD T Total-

energyGT A

11 CGA TTT TT GC-content GC-content T12 TTA TT CG GA G AT13 TAA CCG Total-

energyTAA A AG

14 aaD T GC-content TTA T TGA15 TTT CGG GA TTT AT TC16 TT GTA TAA GTT AG C17 CCG GGA TTA GGA TGA ORF end18 T GTT TTT GC TC AC19 CGG GAC GTT G C CT20 GTA TCG GGA GTA ORF end CA

Table 6.13: The top twenty ranked features based on classification effectiveness from theEnsembl and fRNAdb datasets. The first pair of columns lists the most effective featuresfrom binary class experiements, coding versus non-coding. The second pair of columns liststhe features for the multiclass considering RNA types and proteins. The last pair of columnsis from the multiclass using only RNA types. Both the complete feature set and the de novofeature sets are considered in each of the three experiment types.


binary classification experiment between Ensembl protein coding with the fRNAdb non-

coding RNAs. Starting with the top ranked feature, ‘ORF proportion’, we run the classifier,

then increment the number of features in order of their rank and classify at each step. We

can see the steady rise in performance as the available features are added. The accuracy

rises to 94.8% by the time the top 20 features are used. The complete feature set achieved

an accuracy of 96.3%.

Feats Features added Accuracy Precision Recall

1 ORF proportion 74.8 0.649 0.6752 ORF-size 91.5 0.936 0.8233 TAG 92.3 0.912 0.8724 ORF prediction score 92.8 0.925 0.8725 Number of exons (h) 94.0 0.944 0.8896 Number of exons (c) 94.0 0.944 0.8877 CG 94.6 0.949 0.9008 Conserved exons 94.5 0.949 0.8969 TA 94.5 0.948 0.90010 Conservation score 94.8 0.950 0.90611 CGA 95.0 0.948 0.91412 TTA 95.1 0.948 0.91613 TAA 95.2 0.948 0.91914 aaD 95.0 0.942 0.91915 TTT 94.9 0.941 0.91816 TT 94.9 0.941 0.91817 CCG 95.0 0.941 0.92018 T 94.8 0.939 0.91619 CGG 94.8 0.939 0.91620 GTA 94.8 0.938 0.918

Table 6.14: Classification performance using incrementally, the top twenty ranked featuresfrom the Ensembl and fRNAdb datasets, for the binary classifier. As more features areadded, there is a steady rise in the accuracy, precision and recall. The full model containingall features has an accuracy of 96.3%, precision of 0.966, and recall of 0.976 as shown inTable 6.3.

Chapter 7

Conclusion and future work

Over a short period, our understanding of non-coding RNA has increased dramatically.

No longer just an intermediate for protein synthesis, non-coding RNAs have shown to be

involved in numerous roles in cell biology. At the same time, advancements in transcriptome

studies using RNA-Seq has continued to provide a research platform for new research. Our

work explored the ability of non-coding RNA prediction using an RNA-Seq approach.

7.1 Summary

In this thesis, we present a method and software for classifying transcript sequences as

protein coding vs non-coding, and extend this to distinguish different non-coding RNA

families, which has not been reported in the literature. We also propose a method for

classifying de novo transcriptome contigs from short read RNA-Seq data.

Our results show that the performance of our classifier is comparable to, or in most cases

surpasses, what is reported in the current literature, and suggest that machine learning

based methods can be used to discriminate between different families of non-coding RNA.

The software tools generated in this work are designed to be modular and to be modified

to suit particular needs.

As the number of transcriptome studies continues to increase, especially de novo non-

reference based studies, we expect to see more methods emerge to handle the outputs of

these sometime noisy output sequences. Our investigation into assembled contigs indicate

that classifiers can be expected to contribute in such studies. With improvements in our

63

CHAPTER 7. CONCLUSION AND FUTURE WORK 64

understanding of non-coding RNAs, the quality of non-coding databases, quality of tran-

scriptome experiments and of different assembly algorithms, we expect machine learning

approaches to such problems will continue to improve.

7.2 Future work

Here, we outline a number of areas for improving the calculations described, and directions

that we have yet to explore.

• In our investigation on the full contig set, we found many elements that seem to

be neither functional protein coding nor non-coding, e.g. fragmented contigs and

transcript runoffs in intronic and UTR regions of genes. Depending on the assembly

used, we have seen many fragmented contigs that cannot be merged. It is possible

that these fragmented contigs can have potential features that can be used to classify

into an alternative class of non-functional non-coding RNAs.

• In a true de novo setting in which classification would be applied to a species that

does not have a well-annotated genome sequence, we cannot expect to have database

annotated coding and non-coding sequences for all species. To assess a strictly de novo

classifier we must also explore the ability of building models in one training species

and testing on another.

• Using relative RNA-Seq read coverage as a classifier feature has been shown to be

effective [30, 59, 77]. While this could be done for transcripts and de novo contigs,

our initial focus was on de novo methodologies, and we did not assess this. A quick

follow up could add the RNA-Seq read coverage for each transcript or contig.

• In our collaboration with the Trans-ABySS group we also assessed detecting polyadeny-

lation sites both within transcripts and contig sequences [110]. There is a possibility to

consider this as a source of information when inferring the direction of the transcript

as well as searching for certain polyadenylation signals found in certain 3′ UTRs. Cur-

rently, certain features are not optimised for reverse complement inputs in the feature

extraction and is a topic of further study.

• We assessed only one contig assembly program: ABySS [120], to be used in the de novo

setting. De novo assembly requires higher coverage than reference based methods for

CHAPTER 7. CONCLUSION AND FUTURE WORK 65

reconstructing the transcriptome. It is possible that reference based methods [40, 126]

can increase the sensitivity of transcript detection, though at the same time are also

known to increase false positive results. Evaluating the performance of our classifier

with reference based assembly may also be of interest.

• Our study, along with many others that utilise RNA-Seq, use protocols that are

designed more specifically for protein coding transcript sequencing. Alternative se-

quencing protocols are available that allow the detection of many small non-coding

sequences such as miRNAs. As many non-coding RNAs are small, investigation into

these protocols may provide a more informative framework to test our classifier.

• This thesis investigated different non-coding RNA types and families, and for that task

we focussed mainly on the types found in fRNAdb. Rfam is also one such database

annotated using RNA families. However, our experience has shown it to be difficult

to work with as there were many families with very few entries, as well as entries that

belonged to many families. Due to its strong growth over the years, we do not want to

simply abandon this resource because of these factors, and feel that this should again

be investigated.

Bibliography

[1] Bruce Alberts, Alexander Johnson, Lewis, Julian, Martin Raff, Keith Roberts, andPeter Walter. Molecular Biology of the Cell. Garland Science, 270 Madison Avenue,New York, New York, 5th edition, 2008.

[2] Paulo P. Amaral, Michael B. Clark, Dennis K. Gascoigne, Marcel E. Dinger, andJohn S. Mattick. lncrnadb: a reference database for long noncoding rnas. NucleicAcids Research, 39(suppl 1):D146–D151, 01 2011.

[3] Roberto Arrial, Roberto Togawa, and Marcelo Brigido. Screening non-coding RNAs intranscriptomes from neglected species using PORTRAIT: case study of the pathogenicfungus Paracoccidioides brasiliensis. BMC Bioinformatics, 10(1):239, 2009.

[4] Yan W. Asmann, Michael B. Wallace, and E. Aubrey Thompson. Transcriptomeprofiling using next-generation sequencing. Gastroenterology, 135(5):1466–1468, 112008.

[5] Courtney C. Babbitt, Olivier Fedrigo, Adam D. Pfefferle, Alan P. Boyle, Julie E.Horvath, Terrence S. Furey, and Gregory A. Wray. Both noncoding and protein-coding rnas contribute to gene expression evolution in the primate brain. GenomeBiology and Evolution, 2010(0):67–79, 2010.

[6] JH Badger and GJ Olsen. CRITICA: coding region identification tool invoking com-parative analysis. Mol Biol Evol, 16(4):512–524, 1999.

[7] Asa Ben-Hur, Cheng Soon Ong, Soren Sonnenburg, Bernhard Scholkopf, and GunnarRatsch. Support vector machines and kernels for computational biology. PLoS ComputBiol, 4(10):e1000173–, 10 2008.

[8] E. Birney, J.A. Stamatoyannopoulos, A. Dutta, R. Guig, T.R. Gingeras, E.H. Mar-gulies, Z. Weng, M. Snyder, and E.T. Dermitzakis. Identification and analysis offunctional elements in 1% of the human genome by the encode pilot project. Nature,447(7146):799–816, 06 2007.

[9] Inanc Birol, Shaun D. Jackman, Cydney B. Nielsen, Jenny Q. Qian, Richard Varhol,Greg Stazyk, Ryan D. Morin, Yongjun Zhao, Martin Hirst, Jacqueline E. Schein,Doug E. Horsman, Joseph M. Connors, Randy D. Gascoyne, Marco A. Marra, and

66

BIBLIOGRAPHY 67

Steven J. M. Jones. De novo transcriptome assembly with ABySS. Bioinformatics,25(21):2872–2877, 11 2009.

[10] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, Cam-bridge CB3 0FB, U.K., 2006.

[11] Brigitte Boeckmann, Amos Bairoch, Rolf Apweiler, Marie-Claude Blatter, Anne Es-treicher, Elisabeth Gasteiger, Maria J. Martin, Karine Michoud, Claire O’Donovan,Isabelle Phan, Sandrine Pilbout, and Michel Schneider. The SWISS-PROT pro-tein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research,31(1):365–370, 1 2003.

[12] Dario Boffelli, Jon McAuliffe, Dmitriy Ovcharenko, Keith D. Lewis, Ivan Ovcharenko,Lior Pachter, and Edward M. Rubin. Phylogenetic shadowing of primate sequences tofind functional regions of the human genome. Science, 299(5611):1391–1394, 02 2003.

[13] George A. Calin, Chang-gong Liu, Manuela Ferracin, Terry Hyslop, Riccardo Spizzo,Cinzia Sevignani, Muller Fabbri, Amelia Cimmino, Eun Joo Lee, Sylwia E. Wojcik,Masayoshi Shimizu, Esmerina Tili, Simona Rossi, Cristian Taccioli, Flavia Pichiorri,Xiuping Liu, Simona Zupo, Vlad Herlea, Laura Gramantieri, Giovanni Lanza, Han-sjuerg Alder, Laura Rassenti, Stefano Volinia, Thomas D. Schmittgen, Thomas J.Kipps, Massimo Negrini, and Carlo M. Croce. Ultraconserved regions encoding ncR-NAs are altered in human leukemias and carcinomas. Cancer Cell, 12(3):215 – 229,Sep 2007.

[14] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a Library for Support Vector Ma-chines. National Taiwan University, 2001.

[15] F. Chiaromonte, R. J. Weber, K. M. Roskin, M. Diekhans, W. J. Kent, and D. Haus-sler. The share of human genomic dna under selection estimated from human–mousegenomic alignments. Cold Spring Harbor Symposia on Quantitative Biology, 68:245–254, 01 2003.

[16] Liam Childs, Zoran Nikoloski, Patrick May, and Dirk Walther. Identification andclassification of ncRNA molecules using graph properties. Nucleic Acids Research,37(9):e66–e66, 05 2009.

[17] Rebecca Chodroff, Leo Goodstadt, Tamara Sirey, Peter Oliver, Kay Davies, EricGreen, Zoltan Molnar, and Chris Ponting. Long noncoding RNA genes: conserva-tion of sequence and brain expression among diverse amniotes. Genome Biology,11(7):R72, 2010.

[18] Michele Clamp, Ben Fry, Mike Kamal, Xiaohui Xie, James Cuff, Michael F. Lin, Mano-lis Kellis, Kerstin Lindblad-Toh, and Eric S. Lander. Distinguishing protein-codingand noncoding genes in the human genome. Proceedings of the National Academy ofSciences, 104(49):19428–19433, 12 2007.

BIBLIOGRAPHY 68

[19] Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysisof the mouse genome. Nature, 420(6915):520–562, 12 2002.

[20] Rat Genome Sequencing Project Consortium. Genome sequence of the Brown Norwayrat yields insights into mammalian evolution. Nature, 428(6982):493–521, 04 2004.

[21] Gregory M. Cooper, Michael Brudno, Eric A. Stone, Inna Dubchak, Serafim Bat-zoglou, and Arend Sidow. Characterization of evolutionary rates and constraints inthree mammalian genomes. Genome Research, 14(4):539–548, 04 2004.

[22] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning,20(3):273–297, 1995-09-01.

[23] Jennifer Couzin. Breakthrough of the year: Small RNAs Make Big Splash. Science,298(5602):2296–2297, 2002.

[24] Teresa Creanza, David Horner, Annarita D’Addabbo, Rosalia Maglietta, FlavioMignone, Nicola Ancona, and Graziano Pesole. Statistical assessment of discrimi-native features for protein-coding and non coding cross-species conserved sequenceelements. BMC Bioinformatics, 10(Suppl 6):S2, 2009.

[25] Marcel E. Dinger, Ken C. Pang, Tim R. Mercer, and John S. Mattick. Differentiatingprotein-coding and noncoding RNA: Challenges and ambiguities. PLoS Comput Biol,4(11):e1000176–, 11 2008.

[26] I. Dondoshansky. Blastclust (NCBI software development toolkit), 6.1 edition, 2002.

[27] Sean R. Eddy. Non-coding RNA genes and the modern RNA world. Nat Rev Genet,2(12):919–929, 12 2001.

[28] Sean R. Eddy and Richard Durbin. RNA sequence analysis using covariance models.Nucleic Acids Research, 22(11):2079–2088, 06 1994.

[29] Yasser EL-Manzalawy and Vasant Honavar. WLSVM: Integrating LibSVM into WekaEnvironment, 2005.

[30] Florian Erhard and Ralf Zimmer. Classification of ncrnas using position and sizeinformation in deep sequencing data. Bioinformatics, 26(18):i426–i432, 09 2010.

[31] N. Erho and K. Wiese. An exploration of individual RNA structural elements inRNA gene finding. Computational Intelligence in Bioinformatics and ComputationalBiology (CIBCB), 2010 IEEE Symposium on, pages 1–9, 2-5 May 2010.

[32] Noah Fahlgren, Miya D. Howell, Kristin D. Kasschau, Elisabeth J. Chapman, Christo-pher M. Sullivan, Jason S. Cumbie, Scott A. Givan, Theresa F. Law, Sarah R. Grant,Jeffery L. Dangl, and James C. Carrington. High-throughput sequencing of Arabidop-sis microRNAs: Evidence for frequent birth and death of MIRNA genes. PLoS ONE,2(2):e219, 2007.

BIBLIOGRAPHY 69

[33] Alistair R. R. Forrest, Rehab F. Abdelhamid, and Piero Carninci. Annotating non-coding transcription using functional genomics strategies. Briefings in FunctionalGenomics & Proteomics, 8(6):437–443, 11 2009.

[34] Kelly A. Frazer, Lior Pachter, Alexander Poliakov, Edward M. Rubin, and InnaDubchak. Vista: computational tools for comparative genomics. Nucleic Acids Re-search, 32(suppl 2):W273–W279, 07 2004.

[35] Masaaki Furuno, Ken C Pang, Noriko Ninomiya, Shiro Fukuda, Martin C Frith, CarolBult, Chikatoshi Kai, Jun Kawai, Piero Carninci, Yoshihide Hayashizaki, John SMattick, and Harukazu Suzuki. Clusters of Internally Primed Transcripts RevealNovel Long Noncoding. PLoS Genet, 2(4):e37, 04 2006.

[36] Paul P. Gardner, Jennifer Daub, John G. Tate, Eric P. Nawrocki, Diana L. Kolbe,Stinus Lindgreen, Adam C. Wilkinson, Robert D. Finn, Sam Griffiths-Jones, Sean R.Eddy, and Alex Bateman. Rfam: updates to the RNA families database. NucleicAcids Research, pages gkn766–, 10 2008.

[37] G.B. Golding. Simple sequence is abundant in eukaryotic proteins. PRS, 8(06):1358–1361, 1999.

[38] Sam Griffiths-Jones, Russell J. Grocock, Stijn van Dongen, Alex Bateman, and An-ton J. Enright. miRBase: microRNA sequences, targets and gene nomenclature. Nu-cleic Acids Research, 34(suppl 1):D140–144, 1 2006.

[39] Mitchell Guttman, Ido Amit, Manuel Garber, Courtney French, Michael F. Lin, DavidFeldser, Maite Huarte, Or Zuk, Bryce W. Carey, John P. Cassady, Moran N. Cabili,Rudolf Jaenisch, Tarjei S. Mikkelsen, Tyler Jacks, Nir Hacohen, Bradley E. Bernstein,Manolis Kellis, Aviv Regev, John L. Rinn, and Eric S. Lander. Chromatin signaturereveals over a thousand highly conserved large non-coding RNAs in mammals. Nature,458(7235):223–227, 03 2009.

[40] Mitchell Guttman, Manuel Garber, Joshua Z Levin, Julie Donaghey, James Robinson,Xian Adiconis, Lin Fan, Magdalena J Koziol, Andreas Gnirke, Chad Nusbaum, John LRinn, Eric S Lander, and Aviv Regev. Ab initio reconstruction of cell type-specifictranscriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs.Nat Biotech, 28(5):503–510, 05 2010.

[41] Brian J Haas and Michael C Zody. Advancing RNA-Seq analysis. Nat Biotech,28(5):421–423, 05 2010.

[42] Michael Hackenberg, Martin Sturm, David Langenberger, Juan Manuel Falcon-Perez,and Ana M. Aransay. miRanalyzer: a microRNA detection and analysis tool for next-generation sequencing experiments. Nucleic Acids Research, 37(suppl 2):W68–W76,07 2009.

BIBLIOGRAPHY 70

[43] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann,and Ian H. Witten. The WEKA Data Mining Software: An Update; SIGKDD Explo-rations. SIGKDD Explorations Newsletter, 11(1), June 2009.

[44] Ross C. Hardison, John Oeltjen, and Webb Miller. Long human–mouse sequencealignments reveal novel regulatory elements: A reason to sequence the mouse genome.Genome Research, 7(10):959–966, 10 1997.

[45] Artemis G. Hatzigeorgiou, Petko Fiziev, and Martin Reczko. DIANA-EST: a statis-tical analysis. Bioinformatics, 17(10):913–919, 10 2001.

[46] Shunmin He, Changning Liu, Geir Skogerbo, Haitao Zhao, Jie Wang, Tao Liu, BaoyanBai, Yi Zhao, and Runsheng Chen. NONCODE v2.0: decoding the non-coding. Nucl.Acids Res., page gkm1011, 2007.

[47] David Hendrix, Michael Levine, and Weiyang Shi. miRTRAP, a computational methodfor the systematic identification of miRNAs from high throughput sequencing data.Genome Biology, 11(4):R39, 2010.

[48] Michael Hiller, Sven Findeiß, Sandro Lein, Manja Marz, Claudia Nickel, DominicRose, Christine Schulz, Rolf Backofen, Sonja J. Prohaska, Gunter Reuter, and Pe-ter F. Stadler. Conserved introns reveal novel transcripts in Drosophila melanogaster.Genome Research, 19(7):1289–1300, 07 2009.

[49] I. L. Hofacker, W. Fontana, P. F. Stadler, L. S. Bonhoeffer, M. Tacker, and P. Schuster.Fast folding and comparison of RNA secondary structures. Monatshefte fur Chemie/ Chemical Monthly, 125(2):167–188, 02 1994.

[50] I. L. Hofacker, B. Priwitzer, and P. F. Stadler. Prediction of locally stable RNAsecondary structures for genome-wide surveys. Bioinformatics, 20(2):186–190, 1 2004.

[51] Ivo L. Hofacker. Vienna RNA secondary structure server. Nucleic Acids Research,31(13):3429–3431, 7 2003.

[52] Yair Horesh, Ydo Wexler, Ilana Lebenthal, Michal Ziv-Ukelson, and Ron Unger.RNAslider: a faster engine for consecutive windows folding and its application tothe analysis of genomic folding asymmetry. BMC Bioinformatics, 10(1):76, 2009.

[53] Fan Hsu, W. James Kent, Hiram Clawson, Robert M. Kuhn, Mark Diekhans, andDavid Haussler. The UCSC Known Genes. Bioinformatics, 22(9):1036–1046, 05 2006.

[54] Tzu-Kuo Huang, Ruby C. Weng, and Chih-Jen Lin. Generalized bradley-terry modelsand multi-class probability estimates. J. Mach. Learn. Res., 7:85–115, December 2006.

BIBLIOGRAPHY 71

[55] T. J. P. Hubbard, B. L. Aken, S. Ayling, B. Ballester, K. Beal, E. Bragin, S. Brent,Y. Chen, P. Clapham, L. Clarke, G. Coates, S. Fairley, S. Fitzgerald, J. Fernandez-Banet, L. Gordon, S. Graf, S. Haider, M. Hammond, R. Holland, K. Howe, A. Jenkin-son, N. Johnson, A. Kahari, D. Keefe, S. Keenan, R. Kinsella, F. Kokocinski, E. Kule-sha, D. Lawson, I. Longden, K. Megy, P. Meidl, B. Overduin, A. Parker, B. Pritchard,D. Rios, M. Schuster, G. Slater, D. Smedley, W. Spooner, G. Spudich, S. Trevan-ion, A. Vilella, J. Vogel, S. White, S. Wilder, A. Zadissa, E. Birney, F. Cunningham,V. Curwen, R. Durbin, X. M. Fernandez-Suarez, J. Herrero, A. Kasprzyk, G. Proctor,J. Smith, S. Searle, and P. Flicek. Ensembl 2009. Nucleic Acids Research, 37(suppl1):D690–D697, 01 2009.

[56] A. M. Hughes. Oxford English Dictionary. Isis, 99(3):586, Sep 2008.

[57] D. E. Janes, C. Chapus, Y. Gondo, D. F. Clayton, S. Sinha, C. A. Blatti, C. L. Organ,M. K. Fujita, C. N. Balakrishnan, and S. V. Edwards. Reptiles and mammals havedifferentially retained long conserved noncoding sequences from the amniote ancestor.Genome Biology and Evolution, 3:102–113, 01 2011.

[58] Hui Jia, Maureen Osak, Gireesh K. Bogu, Lawrence W. Stanton, Rory Johnson, andLeonard Lipovich. Genome-wide computational identification and manual annotationof human long noncoding RNA genes. RNA, 16(8):1478–1487, 08 2010.

[59] Chol-Hee Jung, Martin Hansen, Igor Makunin, Darren Korbie, and John Mattick.Identification of novel non-coding RNAs using profiles of short sequence reads fromnext generation sequencing data. BMC Genomics, 11(1):77, 2010.

[60] Manolis Kellis, Nick Patterson, Matthew Endrizzi, Bruce Birren, and Eric S. Lander.Sequencing and comparison of yeast species to identify genes and regulatory elements.Nature, 423(6937):241–254, 05 2003.

[61] W. James Kent. BLAT—the BLAST-like alignment tool. Genome Research,12(4):656–664, 04 2002.

[62] W. James Kent, Charles W. Sugnet, Terrence S. Furey, Krishna M. Roskin, Tom H.Pringle, Alan M. Zahler, and David Haussler. The human genome browser at UCSC.Genome Research, 12(6):996–1006, 06 2002.

[63] Taishin Kin, Kouichirou Yamada, Goro Terai, Hiroaki Okida, Yasuhiko Yoshinari,Yukiteru Ono, Aya Kojima, Yuki Kimura, Takashi Komori, and Kiyoshi Asai.fRNAdb: a platform for mining/annotating functional RNA candidates from non-coding RNA sequences. Nucleic Acids Research, 35(suppl 1):D145–148, 1 2007.

[64] Lei Kong, Yong Zhang, Zhi-Qiang Ye, Xiao-Qiao Liu, Shu-Qi Zhao, Liping Wei, andGe Gao. Cpc: assess the protein-coding potential of transcripts using sequence featuresand support vector machine. Nucleic Acids Research, 35(suppl 2):W345–349, 7 2007.

BIBLIOGRAPHY 72

[65] Jack Kyte and Russell F. Doolittle. A simple method for displaying the hydropathiccharacter of a protein. Journal of Molecular Biology, 157(1):105 – 132, 1982.

[66] S. Sai Lakshmi and Shipra Agrawal. piRNABank: a web resource on classified andclustered Piwi-interacting RNAs. Nucleic Acids Research, 36(suppl 1):D173–D177, 012008.

[67] David Langenberger, Clara Bermudez-Santana, Jana Hertel, Steve Hoffmann, PhilippKhaitovich, and Peter F. Stadler. Evidence for human microRNA-offset RNAs insmall RNA sequencing data. Bioinformatics, 25(18):2298–2301, 2009.

[68] M. A. Larkin, G. Blackshields, N. P. Brown, R. Chenna, P. A. McGettigan,H. McWilliam, F. Valentin, I. M. Wallace, A. Wilm, R. Lopez, J. D. Thompson,T. J. Gibson, and D. G. Higgins. Clustal W and clustal X version 2.0. Bioinformatics,23(21):2947–2948, 11 2007.

[69] Rasko Leinonen, Ruth Akhtar, Ewan Birney, James Bonfield, Lawrence Bower, MattCorbett, Ying Cheng, Fehmi Demiralp, Nadeem Faruque, Neil Goodgame, RichardGibson, Gemma Hoad, Christopher Hunter, Mikyung Jang, Steven Leonard, QuanLin, Rodrigo Lopez, Michael Maguire, Hamish McWilliam, Sheila Plaister, RajeshRadhakrishnan, Siamak Sobhany, Guy Slater, Petra Ten Hoopen, Franck Valentin,Robert Vaughan, Vadim Zalunin, Daniel Zerbino, and Guy Cochrane. Improvementsto services at the European Nucleotide Archive. Nucleic Acids Research, 38(suppl1):D39–D45, 01 2010.

[70] Heng Li and Richard Durbin. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25(14):1754–1760, 07 2009.

[71] Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, GaborMarth, Goncalo Abecasis, Richard Durbin, and 1000 Genome Project Data Process-ing Subgroup. The sequence alignment/map format and SAMtools. Bioinformatics,25(16):2078–2079, 08 2009.

[72] Jiong-Tang Li, Yong Zhang, Lei Kong, Qing-Rong Liu, and Liping Wei. Trans-naturalantisense transcripts including noncoding rnas in 10 species: implications for expres-sion regulation. Nucleic Acids Research, 36(15):4833–4844, 09 2008.

[73] Weizhong Li and Adam Godzik. Cd-hit: a fast program for clustering and comparinglarge sets of protein or nucleotide sequences. Bioinformatics, 22(13):1658–1659, 72006.

[74] Jinfeng Liu, Julian Gough, and Burkhard Rost. Distinguishing protein-coding fromnon-coding rnas through support vector machines. PLoS Genet, 2(4):e29, 04 2006.

[75] G. G. Loots, R. M. Locksley, C. M. Blankespoor, Z. E. Wang, W. Miller, E. M. Rubin,and K. A. Frazer. Identification of a coordinate regulator of interleukins 4, 13, and 5by cross-species sequence comparisons. Science, 288(5463):136–140, 04 2000.

BIBLIOGRAPHY 73

[76] C. Lottaz, C. Iseli, C. V. Jongeneel, and P. Bucher. Modeling sequencing errors bycombining Hidden Markov models. Bioinformatics, 19(suppl 2):ii103–112, 9 2003.

[77] Zhi John Lu, Kevin Y. Yip, Guilin Wang, Chong Shou, LaDeana W. Hillier, Ekta Khu-rana, Ashish Agarwal, Raymond Auerbach, Joel Rozowsky, Chao Cheng, MasaomiKato, David M. Miller, Frank Slack, Michael Snyder, Robert H. Waterson, ValerieReinke, and Mark Gerstein. Prediction and characterization of non-coding RNAs inC. elegans by integrating conservation, secondary structure and high throughput se-quencing and array data. Genome Research, 10.1101/gr.110189.110, December 2010.

[78] R. B. Lyngsø and C. N. Pedersen. RNA pseudoknot prediction in energy-based models.J Comput Biol, 7(3-4):409–427, 2000.

[79] Ariane Machado-Lima, Hernando del Portillo, and Alan Durham. Computationalmethods in noncoding RNA research. Journal of Mathematical Biology, 56(1):15–49,01 2008.

[80] J.R. Manak, S. Dike, V. Sementchenko, P. Kapranov, F. Biemar, J. Long, J. Cheng,I. Bell, S. Ghosh, A. Piccolboni, and T.R. Gingeras. Identification and analysis offunctional elements in 1% of the human genome by the ENCODE pilot project. Nature,447(7146):799–816, 06 2007.

[81] Samuel Marguerat, Brian T. Wilhelm, and Jurg Bahler. Next-generation sequencing:applications beyond genomes. Biochemical Society transactions, 36(Pt 5):1091–1096,October 2008.

[82] Elliott H. Margulies, Mathieu Blanchette, NISC Comparative Sequencing Program,David Haussler, and Eric D. Green. Identification and characterization of multi-speciesconserved sequences. Genome Research, 13(12):2507–2518, 12 2003.

[83] Anthony Mathelier and Alessandra Carbone. MIReNA: finding microRNAs with highaccuracy and no learning at genome scale and from deep sequencing data. Bioinfor-matics, 26(18):2226–2234, 09 2010.

[84] Pedro P. Medina, Mona Nolde, and Frank J. Slack. OncomiR addiction in an in vivomodel of microRNA-21-induced pre-B-cell lymphoma. Nature, 467(7311):86–90, 092010.

[85] Tim R. Mercer, Marcel E. Dinger, and John S. Mattick. Long non-coding RNAs:insights into functions. Nat Rev Genet, 10(3):155–159, 03 2009.

[86] Michael L. Metzker. Sequencing technologies – the next generation. Nat Rev Genet,11(1):31–46, 01 2010.

[87] Flavio Mignone, Anna Anselmo, Giacinto Donvito, Giorgio Maggi, Giorgio Grillo,and Graziano Pesole. Genome-wide identification of coding and non-coding conservedsequence tags in human and mouse genomes. BMC Genomics, 9(1):277, 2008.

BIBLIOGRAPHY 74

[88] Tarjei S. Mikkelsen, Manching Ku, David B. Jaffe, Biju Issac, Erez Lieberman, GeorgiaGiannoukos, Pablo Alvarez, William Brockman, Tae-Kyung Kim, Richard P. Koche,William Lee, Eric Mendenhall, Aisling O/’Donovan, Aviva Presser, Carsten Russ,Xiaohui Xie, Alexander Meissner, Marius Wernig, Rudolf Jaenisch, Chad Nusbaum,Eric S. Lander, and Bradley E. Bernstein. Genome-wide maps of chromatin state inpluripotent and lineage-committed cells. Nature, 448(7153):553–560, 08 2007.

[89] The modENCODE Consortium, Sushmita Roy, Jason Ernst, Peter V. Kharchenko,Pouya Kheradpour, Nicolas Negre, Matthew L. Eaton, Jane M. Landolin, Christo-pher A. Bristow, Lijia Ma, Michael F. Lin, Stefan Washietl, Bradley I. Arshinoff,Ferhat Ay, Patrick E. Meyer, Nicolas Robine, Nicole L. Washington, Luisa Di Ste-fano, Eugene Berezikov, Christopher D. Brown, Rogerio Candeias, Joseph W. Carlson,Adrian Carr, Irwin Jungreis, Daniel Marbach, Rachel Sealfon, Michael Y. Tolstorukov,Sebastian Will, Artyom A. Alekseyenko, Carlo Artieri, Benjamin W. Booth, Angela N.Brooks, Qi Dai, Carrie A. Davis, Michael O. Duff, Xin Feng, Andrey A. Gorchakov,Tingting Gu, Jorja G. Henikoff, Philipp Kapranov, Renhua Li, Heather K. MacAlpine,John Malone, Aki Minoda, Jared Nordman, Katsutomo Okamura, Marc Perry, Sara K.Powell, Nicole C. Riddle, Akiko Sakai, Anastasia Samsonova, Jeremy E. Sandler,Yuri B. Schwartz, Noa Sher, Rebecca Spokony, David Sturgill, Marijke van Baren,Kenneth H. Wan, Li Yang, Charles Yu, Elise Feingold, Peter Good, Mark Guyer,Rebecca Lowdon, Kami Ahmad, Justen Andrews, Bonnie Berger, Steven E. Brenner,Michael R. Brent, Lucy Cherbas, Sarah C. R. Elgin, Thomas R. Gingeras, RobertGrossman, Roger A. Hoskins, Thomas C. Kaufman, William Kent, Mitzi I. Kuroda,Terry Orr-Weaver, Norbert Perrimon, Vincenzo Pirrotta, James W. Posakony, BingRen, Steven Russell, Peter Cherbas, Brenton R. Graveley, Suzanna Lewis, Gos Mick-lem, Brian Oliver, Peter J. Park, Susan E. Celniker, Steven Henikoff, Gary H. Karpen,Eric C. Lai, David M. MacAlpine, Lincoln D. Stein, Kevin P. White, and Mano-lis Kellis. Identification of functional elements and regulatory circuits by drosophilamodencode. Science, 330(6012):1787–1797, 12 2010.

[90] Ryan D. Morin, Matthew Bainbridge, Anthony Fejes, Martin Hirst, Martin Krzy-winski, Trevor J. Pugh, Helen McDonald, Richard Varhol, Steven J.M. Jones, andMarco A. Marra. Profiling the HeLa S3 transcriptome using randomly primed cDNAand massively parallel short-read sequencing. Biotechniques, 45(1):81–94, July 2008.

[91] Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaeffer, and BarbaraWold. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Meth,5(7):621–628, 07 2008.

[92] Ugrappa Nagalakshmi, Zhong Wang, Karl Waern, Chong Shou, Debasish Raha, MarkGerstein, and Michael Snyder. The transcriptional landscape of the Yeast genomedefined by RNA sequencing. Science, 320(5881):1344–1349, 06 2008.

BIBLIOGRAPHY 75

[93] Marcelo A. Nobrega, Yiwen Zhu, Ingrid Plajzer-Frick, Veena Afzal, and Ed-ward M. Rubin. Megabase deletions of gene deserts result in viable mice. Nature,431(7011):988–993, 10 2004.

[94] Kirt Noel. Examining stem-loops as a sequence signal for identifying structural RNAgenes. Master’s thesis, Simon Fraser University, April 2005.

[95] Karl J. V. Nordstrom, Majd A. I. Mirza, Markus Sallman Almen, David E. Gloriam,Robert Fredriksson, and Helgi B. Schloth. Critical evaluation of the FANTOM3 non-coding RNA transcripts. Genomics, 94(3):169–176, 9 2009.

[96] David L. Olson and Dursun Delen. Advanced Data Mining Techniques. SpringerPublishing Company, Incorporated, 1st edition, 2008.

[97] Ulf Andersson Ørom, Thomas Derrien, Malte Beringer, Kiranmai Gumireddy,Alessandro Gardini, Giovanni Bussotti, Fan Lai, Matthias Zytnicki, CedricNotredame, Qihong Huang, Roderic Guigo, and Ramin Shiekhattar. Long noncodingRNAs with enhancer-like function in human cells. Cell, 143(1):46–58, 10 2010.

[98] Ken C. Pang, Martin C. Frith, and John S. Mattick. Rapid evolution of noncodingRNAs: lack of conservation does not mean lack of function. Trends in Genetics,22(1):1–5, 1 2006.

[99] Ken C. Pang, Stuart Stephen, Marcel E. Dinger, Par G. Engstrom, Boris Lenhard,and John S. Mattick. RNAdb 2.0–an expanded database of mammalian non-codingRNAs. Nucleic Acids Research, 35(suppl 1):D178–182, 1 2007.

[100] Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An Eulerian path approachto DNA fragment assembly. Proceedings of the National Academy of Sciences of theUnited States of America, 98(17):9748–9753, 08 2001.

[101] Elisabetta Pizzi and Clara Frontali. Low-complexity regions in Plasmodium falci-parum proteins. Genome Research, 11:218–229, 2001.

[102] Vasilis J. Promponas, Anton J. Enright, Sophia Tsoka, David P. Kreil, ChristopheLeroy, Stavros Hamodrakas, Chris Sander, and Christos A. Ouzounis. CAST: aniterative algorithm for the complexity analysis of sequence tracts. Bioinformatics,16(10):915–922, 10 2000.

[103] Kim D. Pruitt, Tatiana Tatusova, and Donna R. Maglott. NCBI reference sequences(RefSeq): a curated non-redundant sequence database of genomes, transcripts andproteins. Nucleic Acids Research, pages gkl842–, 11 2006.

[104] Matteo Re, Graziano Pesole, and David Horner. Accurate discrimination of conservedcoding and non-coding regions through multiple indicators of evolutionary dynamics.BMC Bioinformatics, 10(1):282, 2009.

BIBLIOGRAPHY 76

[105] Brooke Rhead, Donna Karolchik, Robert M. Kuhn, Angie S. Hinrichs, Ann S. Zweig,Pauline A. Fujita, Mark Diekhans, Kayla E. Smith, Kate R. Rosenbloom, Brian J.Raney, Andy Pohl, Michael Pheasant, Laurence R. Meyer, Katrina Learned, Fan Hsu,Jennifer Hillman-Jackson, Rachel A. Harte, Belinda Giardine, Timothy R. Dreszer,Hiram Clawson, Galt P. Barber, David Haussler, and W. James Kent. The UCSCGenome Browser database: update 2010. Nucleic Acids Research, 38(suppl 1):D613–D619, 01 2010.

[106] Peter Rice, Ian Longden, and Alan Bleasby. EMBOSS: The European MolecularBiology Open Software Suite. Trends in Genetics, 16(6):276 – 277, 2000.

[107] E. Rivas and S. R. Eddy. Noncoding RNA gene detection using comparative sequenceanalysis. BMC bioinformatics, 2(1):8+, 2001.

[108] A. Gordon Robertson, Mikhail Bilenky, Angela Tam, Yongjun Zhao, Thomas Zeng,Nina Thiessen, Timothee Cezard, Anthony P. Fejes, Elizabeth D. Wederell, RebeccaCullum, Ghia Euskirchen, Martin Krzywinski, Inanc Birol, Michael Snyder, Pamela A.Hoodless, Martin Hirst, Marco A. Marra, and Steven J. M. Jones. Genome-widerelationship between histone H3 lysine 4 mono- and tri-methylation and transcriptionfactor binding. Genome Research, 18(12):1906–1917, 12 2008.

[109] Gordon Robertson, Martin Hirst, Matthew Bainbridge, Misha Bilenky, Yongjun Zhao,Thomas Zeng, Ghia Euskirchen, Bridget Bernier, Richard Varhol, Allen Delaney, NinaThiessen, Obi L Griffith, Ann He, Marco Marra, Michael Snyder, and Steven Jones.Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipita-tion and massively parallel sequencing. Nat Meth, 4(8):651–657, 08 2007.

[110] Gordon Robertson, Jacqueline Schein, Readman Chiu, Richard Corbett, MatthewField, Shaun D Jackman, Karen Mungall, Sam Lee, Hisanaga Mark Okada, Jenny QQian, Malachi Griffith, Anthony Raymond, Nina Thiessen, Timothee Cezard, Yaron SButterfield, Richard Newsome, Simon K Chan, Rong She, Richard Varhol, BaljitKamoh, Anna-Liisa Prabhu, Angela Tam, YongJun Zhao, Richard A Moore, MartinHirst, Marco A Marra, Steven J M Jones, Pamela A Hoodless, and Inanc Birol.De novo assembly and analysis of RNA-seq data. Nature Methods, advance onlinepublication, October 2010.

[111] Brid M. Ryan, Ana I. Robles, and Curtis C. Harris. Genetic variation in microRNAnetworks: the implications for cancer research. Nat Rev Cancer, 10(6):389–402, 062010.

[112] R. Salari, C. Aksay, E. Karakoc, P. J. Unrau, I. Hajirasouliha, S. C. Sahinalp, andS. Maas. smyRNA: A Novel Ab Initio ncRNA Gene Finder. PLoS ONE, 4:5433, May2009.

BIBLIOGRAPHY 77

[113] F. Sanger, S. Nicklen, and A. R. Coulson. DNA sequencing with chain-terminatinginhibitors. Proceedings of the National Academy of Sciences, 74(12):5463–5467, 121977.

[114] Kengo Sato, Michiaki Hamada, Kiyoshi Asai, and Toutai Mituyama. CentroidFold: aweb server for RNA secondary structure prediction. Nucleic Acids Research, 37(suppl2):W277–W280, 07 2009.

[115] Bruce A Shapiro, Yaroslava G Yingling, Wojciech Kasprzak, and Eckart Bindewald.Bridging the gap in RNA structure prediction. Current Opinion in Structural Biology,17(2):157 – 165, 2007. Theory and simulation / Macromolecular assemblages.

[116] Kana Shimizu, Jun Adachi, and Yoichi Muraoka. ANGLE: a sequencing errors resis-tant program for predicting protein coding regions in unfinished cDNA. Journal ofBioinformatics Computal Biology, 4(3):649–64, June 2006.

[117] Christian Honer zu Siederdissen and Ivo L. Hofacker. Discriminatory power of RNAfamily models. Bioinformatics, 26(18):i453–i459, 09 2010.

[118] Adam Siepel, Gill Bejerano, Jakob S. Pedersen, Angie S. Hinrichs, Minmei Hou, KateRosenbloom, Hiram Clawson, John Spieth, LaDeana W. Hillier, Stephen Richards,George M. Weinstock, Richard K. Wilson, Richard A. Gibbs, W. James Kent, WebbMiller, and David Haussler. Evolutionarily conserved elements in vertebrate, insect,worm, and yeast genomes. Genome Research, 15(8):1034–1050, 08 2005.

[119] Tulio C. Silva, Pedro A. Berger, Roberto T. Arrial, Roberto C. Togawa, Marcelo M.Brigido, and Maria Emilia M. T. Walter. SOM-PORTRAIT: Identifying Non-codingRNAs Using Self-Organizing Maps, volume 5676/2009 of Lecture Notes in ComputerScience. Springer Berlin / Heidelberg, 2009.

[120] Jared T. Simpson, Kim Wong, Shaun D. Jackman, Jacqueline E. Schein, Steven J.M.Jones, and Inanc Birol. ABySS: A parallel assembler for short read sequence data.Genome Research, 19:1117–1123, February 2009.

[121] G.S.C. Slater. Algorithms for the Analysis of Expressed Sequence Tags. PhD thesis,University of Cambridge, Cambridge, 2000.

[122] Tomasz Smolinski, Mariofanna Milanova, Aboul-Ella Hassanien, Kirt Noel, and KayWiese. Considering Stem-Loops as Sequence Signals for Finding Ribosomal RNAGenes, volume 151, pages 337–357. Springer Berlin / Heidelberg, 2008.

[123] MJ Solomon, PL Larsen, and A Varshavsky. Mapping protein-DNA interactions invivo with formaldehyde: evidence that histone H4 is retained on a highly transcribedgene. Cell, 53(6):937–947, 06 1988.

BIBLIOGRAPHY 78

[124] Jason E. Stajich, David Block, Kris Boulez, Steven E. Brenner, Stephen A. Chervitz,Chris Dagdigian, Georg Fuellen, James G.R. Gilbert, Ian Korf, Hilmar Lapp, HeikkiLehvaslaiho, Chad Matsalla, Chris J. Mungall, Brian I. Osborne, Matthew R. Pocock,Peter Schattner, Martin Senger, Lincoln D. Stein, Elia Stupka, Mark D. Wilkinson,and Ewan Birney. The Bioperl toolkit: Perl modules for the life sciences. GenomeResearch, 12:1611–1618, 2002.

[125] The FANTOM Consortium. The transcriptional landscape of the mammalian genome.Science, 309(5740):1559–1563, 9 2005.

[126] Cole Trapnell, Brian A Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Marijke Jvan Baren, Steven L Salzberg, Barbara J Wold, and Lior Pachter. Transcript assemblyand quantification by RNA-Seq reveals unannotated transcripts and isoform switchingduring cell differentiation. Nat Biotech, 28(5):511–515, 05 2010.

[127] Huei-Hun H. Tseng, Zasha Weinberg, Jeremy Gore, Ronald R. Breaker, and Wal-ter L. Ruzzo. Finding non-coding RNAs through genome-scale clustering. Journal ofbioinformatics and computational biology, 7(2):373–388, April 2009.

[128] Andrew Uzilov, Joshua Keegan, and David Mathews. Detection of non-coding RNAson the basis of predicted secondary structure formation free energy change. BMCBioinformatics, 7(1):173, 2006.

[129] Bjorn Voß. Structural analysis of aligned RNAs. Nucleic Acids Research, 34(19):5471–5481, 2006.

[130] Bjorn Voß, Jens Georg, Verena Schon, Susanne Ude, and Wolfgang Hess. Biocom-putational prediction of non-coding RNAs in model cyanobacteria. BMC Genomics,10(1):123, 2009.

[131] Jiayi Wang, Xiangfan Liu, Huacheng Wu, Peihua Ni, Zhidong Gu, Yongxia Qiao,Ning Chen, Fenyong Sun, and Qishi Fan. CREB up-regulates long non-coding RNA,HULC expression through interaction with microRNA-372 in liver cancer. NucleicAcids Research, 38(16):5366–5383, 09 2010.

[132] Zhong Wang, Mark Gerstein, and Michael Snyder. RNA-Seq: a revolutionary tool fortranscriptomics. Nat Rev Genet, 10(1):57–63, 01 2009.

[133] Stefan Washietl, Ivo L. Hofacker, and Peter F. Stadler. Fast and reliable predictionof noncoding RNAs. Proceedings of the National Academy of Sciences of the UnitedStates of America, 102(7):2454–2459, 2005.

[134] Zasha Weinberg, Jonathan Perreault, Michelle M. Meyer, and Ronald R. Breaker.Exceptional structured noncoding RNAs revealed by bacterial metagenome analysis.Nature, 462(7273):656–659, 12 2009.

BIBLIOGRAPHY 79

[135] Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Toolsand Techniques. Morgan Kaufmann Series in Data Management Systems. MorganKaufmann, second edition, June 2005.

[136] Adam Woolfe, Martin Goodson, Debbie K Goode, Phil Snell, Gayle K McEwen, TanyaVavouri, Sarah F Smith, Phil North, Heather Callaway, Krys Kelly, Klaudia Walter,Irina Abnizova, Walter Gilks, Yvonne J. K Edwards, Julie E Cooke, and Greg Elgar.Highly conserved non-coding sequences are associated with vertebrate development.PLoS Biol, 3(1):e7, 11 2004.

[137] Jing Wu. Testing the coding potential of conserved short genomic sequences. Advancesin Bioinformatics, Article ID 287070, 8 pages, 2010.

[138] Jun Xie, Ming Zhang, Tao Zhou, Xia Hua, LiSha Tang, and Weilin Wu. scaRNAbase:a curated database for small nucleolar RNAs and cajal body-specific RNAs. NucleicAcids Research, 35(suppl 1):D183–D187, 2006.

[139] Chenghai Xue, Fei Li, Tao He, Guo-Ping Liu, Yanda Li, and Xuegong Zhang. Classifi-cation of real and pseudo microRNA precursors using local structure-sequence featuresand support vector machine. BMC Bioinformatics, 6(1):310, 2005.

[140] Zizhen Yao, Zasha Weinberg, and Walter L. Ruzzo. CMfinder—a covariance modelbased RNA motif finding algorithm. Bioinformatics, 22(4):445–452, 2006.

[141] Ying Zhang, Dao-Gang Guan, Jian-Hua Yang, Peng Shao, Hui Zhou, and Liang-HuQu. ncRNAimprint: A comprehensive database of mammalian imprinted noncodingRNAs. RNA, pages –, 08 2010.

[142] Michael Zuker and David Sankoff. RNA secondary structures and their prediction.Bulletin of Mathematical Biology, 46(4):591–621, 07 1984.

[143] Michael Zuker and Patrick Stiegler. Optimal computer folding of large RNA sequencesusing thermodynamics and auxiliary information. Nucleic Acids Research, 9(1):133–148, 1 1981.

classification of coding and non-coding rna in rna...

Documents