classification of coding and non-coding rna in rna...
TRANSCRIPT
CLASSIFICATION OF CODING AND NON-CODING
RNA IN RNA-SEQ DATA
by
Hisanaga Mark Okada
B.Sc., Simon Fraser University, 2008
a Thesis submitted in partial fulfillment
of the requirements for the degree of
Master of Science
in the School
of
Computing Science
c© Hisanaga Mark Okada 2011
SIMON FRASER UNIVERSITY
Spring 2011
All rights reserved. However, in accordance with the Copyright Act of
Canada, this work may be reproduced without authorization under the
conditions for Fair Dealing. Therefore, limited reproduction of this
work for the purposes of private study, research, criticism, review and
news reporting is likely to be in accordance with the law, particularly
if cited appropriately.
APPROVAL
Name: Hisanaga Mark Okada
Degree: Master of Science
Title of Thesis: Classification of coding and non-coding RNA in RNA-Seq
data
Examining Committee: Dr. Anoop Sarkar
Associate Professor, Computing Science
Simon Fraser University
Chair
Dr. Martin Ester
Professor, Computing Science
Simon Fraser University
Senior Supervisor
Dr. Cenk Sahinalp
Professor, Computing Science
Simon Fraser University
Supervisor
Dr. Kay Wiese
Associate Professor, Computing Science
Simon Fraser University
Examiner
Date Approved: February 28, 2011
11
APPROVAL
Name: Hisanaga Mark Okada
Degree: Master of Science
Title of Thesis: Classification of coding and non-coding RNA in RNA-Seq
data
Examining Committee: Dr. Anoop Sarkar
Associate Professor, Computing Science
Simon Fraser University
Chair
Dr. Martin Ester
Professor, Computing Science
Simon Fraser University
Senior Supervisor
Dr. Cenk Sahinalp
Professor, Computing Science
Simon Fraser University
Supervisor
Dr. Kay Wiese
Associate Professor, Computing Science
Simon Fraser University
Examiner
Date Approved: February 28, 2011
11
Last revision: Spring 09
Declaration of Partial Copyright Licence The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users.
The author has further granted permission to Simon Fraser University to keep or make a digital copy for use in its circulating collection (currently available to the public at the “Institutional Repository” link of the SFU Library website <www.lib.sfu.ca> at: <http://ir.lib.sfu.ca/handle/1892/112>) and, without changing the content, to translate the thesis/project or extended essays, if technically possible, to any medium or format for the purpose of preservation of the digital work.
The author has further agreed that permission for multiple copying of this work for scholarly purposes may be granted by either the author or the Dean of Graduate Studies.
It is understood that copying or publication of this work for financial gain shall not be allowed without the author’s written permission.
Permission for public performance, or limited permission for private scholarly use, of any multimedia materials forming part of this work, may have been granted by the author. This information may be found on the separately catalogued multimedia material and in the signed Partial Copyright Licence.
While licensing SFU to permit the above uses, the author retains copyright in the thesis, project or extended essays, including the right to change the work for subsequent purposes, including editing and publishing the work in whole or in part, and licensing other parties, as the author may desire.
The original Partial Copyright Licence attesting to these terms, and signed by this author, may be found in the original bound copy of this work, retained in the Simon Fraser University Archive.
Simon Fraser University Library Burnaby, BC, Canada
Abstract
Recently, the coverage of non-protein-coding RNA in the scientific literature has expanded
dramatically. While the functions for many are unknown, strong interest in this aspect
of cellular biology is driving development of methods for detecting non-coding genes and
transcripts.
During the same period, RNA sequencings high throughput and high spatial resolution
have established it as the preferred method for characterising transcriptomes. Many groups
are now sequencing transcriptomes. De novo transcriptome assembly methods are being
developed to address issues for which no reference genome is available.
We propose a methodology that is compatible with de novo transcriptome assembly,
that uses sequence, structural and genomic features to classify transcripts as non-coding vs.
protein-coding RNA, and to classify different non-coding RNA types. We have applied our
technique on a variety of known RNA sequences and have explored its use on contigs from
the Trans-ABySS assembly pipeline for RNA-Seq data from normal mouse tissues.
iii
To family and friends
iv
“As iron sharpens iron, so one man sharpens another”
— Proverbs 27:17
v
Acknowledgments
I wish to express my deepest gratitude to the many individuals whose support and assistance
made this work described in this thesis possible.
As my senior supervisor, I thank Martin Ester for giving me the academic and personal
guidance I needed. I am grateful for his patience and for encouragement during the entire
length of my research. I wish to also thank the members of my committee for their invaluable
counsel. I wish to thank the members of the Data Mining Lab, and the Computing Science
Department at Simon Fraser University for providing the environment I needed to perform
this research. Thanks especially to Phuong Dao for his expertise in countless matters.
This work was possible because of our collaboration with the Michael Smith Genome
Sciences Centre (GSC). I thank Inanc Birol, Jacqueline Schein, Pamela Hoodless and espe-
cially Gordon Robertson for providing me with so many opportunities, and for going above
and beyond their supervisory roles. I gratefully acknowledge the GSCs making available the
seven mouse transcriptome datasets generated in the Genome Canada MORGEN project. I
would like to acknowledge in particular: Sam Lee for generating the RNA reagents; Yongjun
Zhao who manages the library construction teams; Nina Thiessen and An He who applies the
GSCs production WTSS pipeline; and Shaun Jackman, Readman Chiu, Rong She, Jenny
Qian, Karen Mungall, for de novo contig data from ABySS and Trans-ABySS.
This work was funded by the Canadian Institute of Health Research / Michael Smith
Foundation for Health Research Bioinformatics Training Program. I am extremely grateful
that they have provided such a supportive community for bioinformatics research. I wish
to acknowledge in particular Marco Marra, Steve Jones and Sharon Ruschkowski.
Lastly, I wish to thank my family and friends for their unconditional love and support.
As chaotic as it seemed at times, they kept me grounded. Thanks to them, I will always
look back fondly at this time.
vi
Contents
Approval ii
Abstract iii
Dedication iv
Quotation v
Acknowledgments vi
Contents vii
List of Tables x
List of Figures xiii
1 Introduction 1
1.1 Significance of non-coding RNA classiffication . . . . . . . . . . . . . . . . . . 1
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 How this thesis is organised . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Biological background 5
2.1 Second generation sequencing and transcriptomics . . . . . . . . . . . . . . . 5
2.2 Central dogma of molecular biology . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Non-coding RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Related work 9
3.1 Discovery of non-coding RNAs . . . . . . . . . . . . . . . . . . . . . . . . . . 9
vii
3.1.1 Sequence based approaches . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.2 Secondary structure based approaches . . . . . . . . . . . . . . . . . . 10
3.1.3 Comparative Genomics based approaches . . . . . . . . . . . . . . . . 11
3.1.4 Genome scanning / mapping approaches . . . . . . . . . . . . . . . . . 11
3.2 RNA databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Classification 14
4.1 Preprocessing reads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.1 Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.2 Mapping to RNA database . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.1 Sequence based features . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.2 Secondary structure based features . . . . . . . . . . . . . . . . . . . . 21
4.2.3 Genomic map based features . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3.1 Support vector machines . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3.2 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3.3 Cross validation evaluation . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.4 Full contig prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.5 Feature set ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5 Implementation 30
5.1 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.1.1 Sequence based feature extraction . . . . . . . . . . . . . . . . . . . . 30
5.1.2 Secondary structure feature extraction . . . . . . . . . . . . . . . . . . 31
5.1.3 Genomic map based feature extraction . . . . . . . . . . . . . . . . . . 32
5.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2.1 Support vector machine . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6 Experimental results 34
6.1 Coding and non-coding databases . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.1.1 EMBL and Swissprot vs. non-coding . . . . . . . . . . . . . . . . . . . 35
6.1.2 Ensembl protein coding vs. non-coding . . . . . . . . . . . . . . . . . 36
6.1.3 Ensembl vs. fRNAdb . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
viii
6.2 The RNA-Seq dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2.1 Contig preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2.2 Transcriptome reads mapped to the genome . . . . . . . . . . . . . . . 43
6.2.3 Contig assembly and merging . . . . . . . . . . . . . . . . . . . . . . . 46
6.2.4 Contig to annotation mapping . . . . . . . . . . . . . . . . . . . . . . 46
6.2.5 Contig cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.2.6 Full contig set classification . . . . . . . . . . . . . . . . . . . . . . . . 51
6.3 Feature ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
7 Conclusion and future work 63
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Bibliography 66
ix
List of Tables
4.1 Features available from the prediction model. Sequence and secondary based
feature make up the de novo set of features. The concepts of the features are
described in section 4.2, and the implementation in section 5.1.. . . . . . . . . 19
4.2 Confusion matrix (or coincidence matrix) for a two-class classification prob-
lem. The correct predictions, true positive and true negative, are shaded
while the erroneous predictions, false positives and false negatives, are not. . 26
6.1 SSGC performance compared with PORTRAIT for the dataset composed
of Swiss-prot and EMBL for protein coding set, and Rfam, RNADB and
NONCODE for the non-coding set. Precision and recall are shown for the
non-coding class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.2 SSGC performance compared with PORTRAIT for the dataset composed of
Ensembl protein coding, and Rfam, RNADB and NONCODE for the non-
coding set. Precision and recall are shown for the non-coding class. . . . . . . 37
6.3 Binary classification performance between Ensembl protein coding with all
fRNAdb non-coding sequences. The first row represents the experiment where
all features are used. The second row represents the experiment where only
the de novo features were used. . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.4 Pairwise classification performance between Ensembl protein coding elements
vs. each RNA type found in fRNAdb. The first half represents the results
where all features are used. The second half represents the results where only
de novo features were used, thereby excluding genome mapped information
such as the number of exons and cross-species conservation scores. . . . . . . 39
6.5 Pairwise classification performance using the complete feature set for fRNAdb
non-coding RNA. Precision and recall are only shown for the second class. . . 40
x
6.6 Pairwise classification performance using de novo feature set for fRNAdb
non-coding RNA, similar to Table 6.5. Precision and recall are only shown
for the second class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.7 Confusion matrix for the multiclass classification using fRNAdb RNA types,
using the entire feature set. The cells represent the number of predictions for
each type, the shaded cells represent the number of true positives. Each RNA
type is labelled from a to i, representing in order: fly-smallRNA, mat-miRNA,
misc, piRNA, pre-miRNA, rRNA, snoRNA, snRNA and tRNA. . . . . . . . . 42
6.8 Six seven-lane RNA-Seq mouse libraries were exained. . . . . . . . . . . . . . 43
6.9 Six seven-lane RNA-Seq libraries were assembled, merged to create the contig
sets. These contigs were used as input for the classifier. . . . . . . . . . . . . 46
6.10 Classification performance using the contigs from the library MM0564, using
the full feature set. The contig sets are mapped to protein coding sequences
from Ensembl, and non-coding RNA sets from fRNAdb using a series of map-
ping thresholds. The top half of the table represents the classification results
using features extracted from the contig sequences. The lower half repre-
sents the classification results using the features extracted from the original
sequence from either Ensembl or fRNAdb that each contig mapped to. . . . . 49
6.11 Classification performance for the stratified contigs from library MM0564,
using the full feature set. In comparison to Table 6.10, the number of ele-
ments in each class are equal. The contig sets are mapped to protein coding
sequences from Ensembl, and non-coding RNA sets from fRNAdb using a
series of mapping thresholds. The top half of the table represents the classifi-
cation results using features extracted from the contig sequences. The lower
half represents the classification results using the features extracted from the
original sequence from either Ensembl or fRNAdb that each contig mapped
to. Note that for thresholds at 1.0, there are not enough elements to perform
classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
xi
6.12 Classification performance for the database sequences mapped by the unfil-
tered contig sets from MM0564; each classification is compared with POR-
TRAIT. The precision and recall is only shown for the non-coding class. We
were not able to compare the classification accuracies for the actual contig
sets themselves. Note the number of elements is lower for PORTRAIT due
to the size restrictions for their input. . . . . . . . . . . . . . . . . . . . . . . 53
6.13 The top twenty ranked features based on classification effectiveness from the
Ensembl and fRNAdb datasets. The first pair of columns lists the most
effective features from binary class experiements, coding versus non-coding.
The second pair of columns lists the features for the multiclass considering
RNA types and proteins. The last pair of columns is from the multiclass using
only RNA types. Both the complete feature set and the de novo feature sets
are considered in each of the three experiment types. . . . . . . . . . . . . . 61
6.14 Classification performance using incrementally, the top twenty ranked fea-
tures from the Ensembl and fRNAdb datasets, for the binary classifier. As
more features are added, there is a steady rise in the accuracy, precision
and recall. The full model containing all features has an accuracy of 96.3%,
precision of 0.966, and recall of 0.976 as shown in Table 6.3. . . . . . . . . . . 62
xii
List of Figures
1.1 A top level structure of our approach from the short read sequence down to
the classification of RNA transcripts. We are both interested in using reads
and contigs as part of the input as well as the potential to classify different
non-coding RNA families. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 The Central Dogma of molecular biology. On the left is the typical transcrip-
tion and translation steps for a given gene. The end product is translated
amino acid sequence that eventually forms a protein. On the right is the tran-
scription of a non-coding RNA, the 3-D structure consisting of its secondary
structure.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.1 Overview of the contig assembly and labelling procedure. From short read
transcriptome reads, contigs are assembled and merged. Contigs are mapped
individually to protein coding and non-coding RNA datasets. Contigs inherit
the labels of the database elements with the best matched mapping score,
which must be above a set threshold. For each mapping score, there are two
threshold values, one for the contig and one for the annotation. The labelled
contigs are used as training and testing sequences for the classifier. . . . . . . 16
4.2 The classification approach starting from the sequence reads down to the test-
ing of RNA transcripts. We propose a classifier that draws on three categories
of features based on sequence, secondary structure, and genome mapped data,
which we name the Sequence-Structure-Genome Classifier (SSGC). For de
novo experiments, we only consider sequence and secondary structure based
features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
xiii
4.3 Contig prediction procedure for the full contig set. A subset of contigs
mapped to protein coding and non-coding sequences from Ensembl and fRNAdb,
respectively, are used to train an SVM model. The SVM model is used to
classify the entire contig set, predicting the class and p-value for each contig.
The p-value allows the contigs to be ranked, from strongly protein coding (0)
to non-coding (1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.1 Read coverage for Ensembl broken down to biotypes, for RNA-Seq reads
from library MM0564. Each biotype is represented as an ECDF and as a
distribution of log10 read coverage. . . . . . . . . . . . . . . . . . . . . . . . 44
6.2 Empirical cumulative distribution function representing the read coverage for
a select number of Ensembl biotypes mapped to the mm9 reference genome
from Figure 6.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.3 Number of unique contigs that map to the sequence annotation databases
fRNAdb and Ensembl using a range of mapping thresholds for all six mouse
libraries. (a) and (c) represent the filtered contig set mappings, (b) and (d)
represent the unfiltered contig set mappings. . . . . . . . . . . . . . . . . . . 48
6.4 Ensembl transcripts mapped by filtered (a) and unfiltered (b) MM0564 con-
tigs, broken down into individual biotypes. . . . . . . . . . . . . . . . . . . . 50
6.5 fRNAdb transcripts mapped by filtered (a) and unfiltered (b) MM0564 con-
tigs, broken down into individual RNA types. . . . . . . . . . . . . . . . . . 51
6.6 The full MM0564 contig set is predicted by the SVM model, and are assigned
probabilities. Contigs with p-values below 0.5 are classified as protein coding,
while contigs with p-values above 0.5 are classified as non-coding. (a) is the
class prediction for all contigs. (b) is the p-value distribution of all the contigs,
(c) is the p-value of contigs with no alignments to any known non-coding
transcripts. (d) is the p-value for all contigs 500bp and larger. . . . . . . . . 55
xiv
6.7 Mapping scores and sizes of contigs strongly predicted as protein coding (p-
value ≤ 0.05) and non-coding (p-value ≥ 0.95). a,b) Distribution of mapping
scores with the best-aligned a) protein-coding Ensembl sequence, b) non-
coding fRNAdb sequence. c,d) Distribution of contig sizes (white). In (c),
the red regions represent strongly protein coding (p-value ≤ 0.05) which do
not map to any known sequences in Ensembl or fRNAdb. In (d), the orange
regions represent strongly non-coding (p-value ≥ 0.95) which do not map to
any known sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.8 Contig k50:177614 aligned in the mouse mm9 genome. The top track rep-
resents the multiple contigs that are mapped to this location. The second
set of tracks are the pileups for the RNA-Seq read alignments for the six
mouse transcriptome libraries. Below the contig track is the gene track and
the conservation track. This contig has a p-value of 1.0 and does not map to
any known non-coding or protein coding sequences. . . . . . . . . . . . . . . 57
6.9 Contig k29:3267973 aligned in the mouse mm9 genome. Similar to Figure
6.8, the tracks represent the assembled contigs, RNA-Seq read pileups, the
contig, known gene annotations, and conservation. This contig has a p-value
of 0 and does not map to any known non-coding or protein coding sequences. 59
6.10 Contig k29:3267973 (from Figure 6.9) represented in the human hg18 genome,
using the LiftOver tool from the UCSC Genome Browser [62]. The tracks rep-
resent the contig coordinate (from the LiftOver), the contig BLAT alignment,
known human gene models, histone modification tracks, and the conservation. 60
xv
Chapter 1
Introduction
1.1 Significance of non-coding RNA classiffication
The Central Dogma of molecular biology states that the flow of genetic information is from
deoxyribonucleic acid (DNA) to ribonucleic (RNA), and RNA to protein [1]. Although
exceptions to this rule were known, for example transfer RNA and ribosomal RNA, diverse
types of non-coding RNA, i.e. transcribed RNA elements that do not code for proteins [85],
are increasingly recognised as widespread and functionally important. Non-coding RNA may
help resolve what has puzzled many researchers since the initial discoveries in genomics—
the number of genes are not significantly different between species; both protein coding and
non-coding RNAs appear to be important in the variability between species [5].
The importance of the non-coding transcriptome has motivated work to develop methods
of identifying and classifying non-coding genes and transcripts. There are three approaches
in the literature. The first uses a set of computationally inexpensive features that can be
mined for patterns that distinguish sequences as coding or non-coding genes; many methods
use variations of clustering or a support vector machine (SVM). Widely used features include
GC content, contig length, open reading frame (ORF) length, stop codon quantity, nucleic
acid composition, amino acid composition, and protein complexity. These sequence-based
features are properties calculated from the sequence at hand. Second, because non-coding
RNA transcripts have functional secondary structures, a number of approaches assume that
the secondary structures predicted for a sequence can be used to calculate the likelihood
of the transcript being non-coding. Examples include RNApfold and RNAz, both of which
are a part of or rely on the Vienna RNA package [49, 50]. As structural features are
1
CHAPTER 1. INTRODUCTION 2
more computationally expensive, attempts have emerged that use sliding windows. Finally,
a number of approaches use genomically-mapped evidence like transcript and expressed
sequence tag (EST) alignments, chromatin profiles and evolutionarily conserved regions.
For this project, we used mapped RNA-Seq reads, mapped de novo contigs, chromatin
profiles, and conservation data.
RNA-Seq, based on second generation deep sequencing technologies, is an effective tool
for quantifying the expression levels of the transcriptome using short sequence reads orig-
inating from fragmented transcripts [132]. Although RNA-Seq has been primarily used to
detect the transcript of protein coding RNA, the technology has increasingly been applied
to detect non-coding RNAs [32, 67, 59].
For this thesis, we introduce the Sequence-Structure-Genome Classifier (SSGC). Using
SSGC, we investigate the transcript classification problem using short-read sequencing data.
Existing studies on non-coding RNAs, using RNA-Seq have relied on mapping reads to a ref-
erence genome; we investigate the classification problem using contigs from a non-reference
based approach, using the de novo transcriptome assembly. By introducing assembly to
non-coding RNA classification, we allow the ability to work on de novo settings. In our in-
vestigation, we also take in consideration the large sizes and noisy nature of these datasets.
We demonstrate the effectiveness of the various feature sets under an assortment of test
conditions.
These will be the major steps in building and running SSGC:
1. Create contigs
- RNA-Seq assembly
2. Build classifier
- label contigs as protein coding or non-coding RNAs
- train SVM model using labelled contigs
3. Run contigs on classifier
- include genomically mapped evidence as attributes
CHAPTER 1. INTRODUCTION 3
Reads
Contigs
Protein codinggenes
non-codingRNAs
miRNA
tRNApiRNA
lincRNA
lncRNA
rRNAsnoRNA
pre-miRNA
Figure 1.1: A top level structure of our approach from the short read sequence down to theclassification of RNA transcripts. We are both interested in using reads and contigs as partof the input as well as the potential to classify different non-coding RNA families.
1.2 Contributions
In this thesis, we extend existing work in which transcript sequences from public databases
were classified into two groups, i.e. protein coding vs non-coding. First, we extend the
classification to discriminate between non-coding RNA families (Figure 1.1). Then, we
apply the classifier to RNA-Seq data, and to de novo transcriptome assembly that uses such
short-read data to generate contigs [110]. De novo assembly can be used with non-model
species for which a reference genome is not available, and can detect chimeric transcripts
that are not represented by a reference genomes gene models but can be important in disease
(ref). We show that non-coding RNA family types can be identified in RNA-Seq data, and in
de novo transcriptome contigs. We outline potential constraints, related to expression level
and sequencing depth, in comprehensively characterising non-coding RNA in sequence data.
The software developed for this thesis is available for use with high-throughput RNA-Seq
and de novo transcriptome assembly pipelines.
CHAPTER 1. INTRODUCTION 4
1.3 How this thesis is organised
The first three chapters give background material: Chapter 2 briefly summarises the bio-
logical concepts, and Chapter 3 summarises published work related to the thesis. The next
two chapters describe the classifier: Chapter 4 explains concepts, and Chapter 5 provides
details on the tools and methods used. Chapter 6 explains the results of using the classi-
fier on database sequences and de novo transcriptome contigs from real biological samples.
Chapter 7 concludes with final remarks and possible future directions.
Chapter 2
Biological background
Bioinformatics is an interdisciplinary study and a wide variety of topics are covered in this
thesis. This section acts as a primer to the biological terms and concepts that are used in
this thesis.
2.1 Second generation sequencing and transcriptomics
DNA sequencing has existed since the beginning of molecular biology. The Sanger method [113]
is the well known and revolutionary first generation technology based on dideoxy chain ter-
mination; first generation technology has been used to unlock sequences lengths in the order
of several hundred base-pairs. Second generation sequencing technologies emerged decades
later, towards the end of the first human genome project. The dominant platforms, Illu-
mina, Roche 454, and ABI SOLiD, have high throughput but generate shorter sequence
reads [86].
Transcription is the synthesis of RNA ribonucleotides using polymerase and a DNA se-
quence as the template. Transcriptome studies have been an important part of molecular
biology and bioinformatics research as expressed RNA is often a precursor for protein syn-
thesis [1]. RNA-Seq, or whole transcriptome shotgun sequencing, is a recently developed
method that uses second generation sequencing on a transcriptome to survey the RNA ex-
pression landscape [92, 91]. RNA-Seq is performed by capturing RNA transcripts by their
poly-A tail, converting the RNA sequence to double stranded DNA by reverse transcriptase,
fragmenting and sequencing using second generation technology. [132]. RNA-Seq has been
shown to be effective in profiling the expression level of transcripts [132, 81, 4, 92], as well
5
CHAPTER 2. BIOLOGICAL BACKGROUND 6
as identifying novel transcription events [110, 41, 126, 40].
2.2 Central dogma of molecular biology
Molecular biology is the study of the formation, organisation and activity of macromolecules
essential to life [56]. This is encapsulated by the Central Dogma, one that states that the
flow of genetic information in cells is from DNA to RNA to protein [1]. For a given gene, this
can be broken down into two steps: transcription and translation (Figure 2.1). Transcription
is the process of synthesising a chain of RNA oligonucleotides from the sequence of a DNA
template. The resulting oligonucleotide chain, or transcript, is known as the messenger
RNA (mRNA).
Translation is the process of synthesising amino acid polymers by reading the open
reading frame (ORF) found within the transcript sequence. The ORF of a transcript is
the segment of the transcript that is used to encode the amino acid sequence. It is the
chemical properties of the amino acid, or peptide, sequence that give it its structure and
function. The regions outside the ORF of a transcript is called the untranslated region
(UTR). Transcripts, as DNA and RNA, have a direction of synthesis and transcription.
The beginning of the transcript starts with the 5′ end and terminates at the 3′ end. From
the original sequence of a DNA source, transcripts are appended with a 5′ cap containing
a modified guanine nucleotide and a poly-adenylation (poly-A) tail on the 3′ end consisting
of a long set of adenosine sequences [1].
2.3 Non-coding RNA
Despite the fundamental significance of the Central Dogma, we have come to realise im-
portant exceptions of this principle. Of the dry weight of RNA extracted from a cell, only
3-5% consists of mRNA, similar to the proportion of genes that make up the genome [1]. In
contrast, as much as 62% of the mouse genome [125], 85% of the fruit fly genome [80], and
93% of the human genome [8] has been estimated to be transcribed.
Non-protein coding, or non-coding RNAs, are RNA products that are not translated to
proteins after transcription (Figure 2.1). Recently there has been an explosion of micro-RNA
(miRNA) research and their critical roles as gene regulation [85, 97], and their implications
for tumorigenesis [111, 13, 84, 131]. miRNA, along with other small RNAs were once named
CHAPTER 2. BIOLOGICAL BACKGROUND 7
the breakthrough of the year by Science magazine [23]. Overall, there are a number of non-
coding RNA types such as those involved in the translation process, ribosomal RNA (rRNA)
and transfer RNA (tRNA); small non-coding RNAs such as micro RNA (miRNA), small
interfering RNA (siRNA), small temporal RNA (stRNA), small nuclear RNA (snRNA),
small nucleolar RNA (snoRNA), piwi-interacting RNA (piRNA); and the more elusive long
non-coding RNA (lncRNA) which include long intergenic non-coding RNA (lincRNA).
ORF features
Protein coding mRNAs have characteristics that are well defined as explained earlier. ORFs
are mostly thought to be unique to that of protein coding genes. There are exceptions to
this concept as bifunctional RNAs have been documented to have functioning ORFs [25, 2].
There are however controversies surrounding non-coding RNAs as the function of many
annotated non-coding RNAs are not known. Of the transcript products found in the FAN-
TOM database [125], there are reports that many of the transcripts are the result of unde-
graded protein coding mRNA, undegraded introns, internal priming, putative protein coding
genes and some have low conservation across species [95]. This have also been reports where
large deletions in gene deserts associated with non-coding DNA had no effect on mice [93].
Recently, comparing newer RNA-Seq methods to potentially noisier microarrays have shown
that non-coding RNAs may not be transcribed as once thought [91].
CHAPTER 2. BIOLOGICAL BACKGROUND 8
pre-mRNA
genome
mRNA
non-coding RNA
folded non-coding RNA
protein
introns
ORF
exons
transcription
translation
transcription
peptide sequence
poly-A tail5’ cap
5’ UTR
3’ UTR
Figure 2.1: The Central Dogma of molecular biology. On the left is the typical transcriptionand translation steps for a given gene. The end product is translated amino acid sequencethat eventually forms a protein. On the right is the transcription of a non-coding RNA, the3-D structure consisting of its secondary structure.1
———————————
13-D images from PDB (http://www.pdb.org/) and EBI (http://www.ebi.ac.uk/)
Chapter 3
Related work
Many non-coding RNAs have been known for decades [27], though it is only recently where
various computational methods to detect these entities have started to emerge. Using various
methodologies, many attempts have been made to classify, find, validate and store non-
coding RNAs. In this chapter, we summarise these methodologies.
3.1 Discovery of non-coding RNAs
In this section, we review strategies in the literature that find non-coding RNAs by cate-
gorising the methods into groups based on sequence, structure, comparative genomics, and
scanning methods.
3.1.1 Sequence based approaches
Sequence based methods classify entities as non-coding RNAs or protein coding RNA by
using the primary nucleotide sequence as input. The literature shows that many biologically
relevant features can be extracted from the sequence such as GC content, sequence motifs,
and nucleotide usage. The extracted features can be converted to numerical values that can
be fed into a machine learning model.
CRITICA [6] uses two types of features: comparative genomics features that use DNA
alignment from a DNA database (refer to section 3.1.3), and sequence based features that
compute distributions of hexanucleotides in coding frames and take into account dicodon
biases. DIANA-EST [45] uses artificial neural networks to find coding regions from ESTs.
9
CHAPTER 3. RELATED WORK 10
ESTSCAN [76] also finds the coding regions of ESTs using a Hidden Markov Model. POR-
TRAIT [3] and SOM-PORTRAIT [119] both extract sequence and ORF-related features
and performs classification using support vector machines and artificial neural networks.
CONC [74] and CPC [64] uses a large collection of simple features such as length, amino
acid composition, GC content, nucleotide identity, 3-periodicity, and simple thermodynam-
ics, to feed into a machine learning method to perform the classification; a large source of
their information does come from comparative methods using BLASTX. Creanza et al. [24]
and Re et al. [104] also use a large collection of features to perform classification, the most
effective feature reportedly being synonymous nucleotide substitutions. Clamp et al. [18],
Li et al. [72], Jia et al. [58], and Wu et al. [137] use methods to extract the open reading
frame of transcripts. Siederdissen et al. [117] uses covariance models using only sequence
information to distinguish between many non-coding RNA families.
3.1.2 Secondary structure based approaches
Secondary structure based classifiers assume functional non-coding RNA have secondary
structures that can be fully or partially predicted and used to extract properties to distin-
guish non-coding RNA from other elements. These properties can include stem loop related
features that can include prevalence, size and GC content [94, 122], while other strategies
estimate fold energies in both global and local contexts. Also, despite the fact that 3′ UTRs
of mRNAs also contain secondary structure [25], a number of secondary structure based
methods have been shown to have reliable rates of success. Another major consideration is
that secondary structure prediction is computationally expensive, forcing workarounds such
as local secondary structure input. These methods perform a scan of the input sequences
and for every window calculate the local secondary structure and consequent attributes.
Xue et al. [139] and Noel et al. [94] uses a method of extracting local features within
the largest stem loop to classify real and pseudo miRNA precursors. The miRanalyzer
web tool [42] scans the genome using the local secondary structure prediction program
RNAfold [51] and for every window extract features strongly related to folding and loop
energy such as length, stem length, Mfe, and GC. Classification is done using the random
forest scheme found in the WEKA package [43]. Langenberger et al. [67] scans for RNA
folds in a sliding window along mapped reads. Horesh et al. [52] also implemented their
method by a sliding window method along a genome to find locally stable RNA structures
and investigates dinucleotide biases that have an effect on the minimal free energies. Childs
CHAPTER 3. RELATED WORK 11
et al. [16] builds a classifer to infer functionality based on a system where each molecule
of a RNA structure is represented as a graph. miRTRAP [47] assess features derived from
loops of miRNA to identify miRNAs from high throughput sequencing data.
3.1.3 Comparative Genomics based approaches
Another common method of finding non-coding RNA is to use information from several
sources such as alignment data from related species. This method is known as comparative
genomics. These methods are especially useful when genomic and transciptomic information
from related species are known. Many approaches use a combination of existing tools such as
ClustalW [68], consensus structure prediction, sequence aligment properties [28] and aligned
structure analysis [130, 33, 133, 24].
RNAz [133] was one of first major methods to predict functional non-coding RNA by
using a combination of sequence alignments, secondary structure and SVM classification.
Dynalign [128] detects non-coding RNAs by predicting secondary structures and thermal en-
ergy for multiple aligned RNAs using a combination of methods including using RNAz [133]
and QRNA [107]. Mignone et al. [87] compares the genomes of human and mouse to find
conserved sequences to evaluate protein coding potential using the notion of conserved se-
quenced tags (CSTs) to produce blocks of BLAST-like high scoring pairs. Voß et al. [130]
predicts non-coding RNAs by using the alignment tool ClustalW [68] and the consensus
structure prediction tool RNAlishapes [129]. Weinberg et al. [134] has uncovered non-coding
RNA by using a number of structure and motif based methods such as CMfinder [140]. Cen-
troidFold [114] is a web server for RNA secondary structure prediction engine that takes in
an RNA sequence along with its alignment as input. Mathelier et al. [83] finds miRNA using
5 parameters that are heavily influenced by fold properties and energies. Tseng et al. [127]
uses genome scale blasting that combines secondary structure and primary sequences by
using folded-BLAST in intergenic regions.
3.1.4 Genome scanning / mapping approaches
The last category we investigate are methods that find non-coding RNA by incorporating
genome scanning methods to identify new RNAs. These methods use the genomic sequence
as the primary input and use subtle clues to pinpoint locations of possible non-coding RNAs.
Although these are not directly part of this thesis, their goals and strategies are insightful
CHAPTER 3. RELATED WORK 12
for our purposes. This category includes strategies that observe motifs and read alignments
from transcriptomes.
Hiller et al. [48] scans the genome for conserved introns to find novel transcripts especially
focusing on the set of mRNA-like non-coding RNAs. Salari et al. [112] employs a method
of scanning motifs along a reference genome using k-mer motifs lengths. Erhard et al. [30]
and Chol et al. [59] both use mapped reads from transcriptome experiements and mainly
use their position and size to find and classify non-coding RNA on the genome. Hofacker et
al. [50] uses local RNA folding on a genome wide scale to discover potential RNA structures.
3.2 RNA databases
In response to the expanding set of non-coding RNAs discovered, a number of databases
have emerged to accommodate their unique characterisics. Many cater to specific types
while others are more inclusive.
Although technically a transcriptome database, FANTOM [125] is known to house many
known and unknown EST sequences including non-coding RNAs. RNAdb [99], fRNAdb [63],
NONCODE [46], and RFam [36] are databases that have their own set of classifications or
family types and all have a user interface available publicly on their servers. RFam [36]
is a database of published non-coding RNAs that uses various tools in covariance models
to WU-Blast to catogorise entries to their extensive categorical families. RNAdb [99] is
a database that specifically applies to mammalian non-coding RNAs, combining several
sources. fRNAdb [63] is a database that aims to categorise functional RNA candidates and
includes tools to analyse structure motifs and EST support evaluation. NONCODE [46]
examines a number of non-coding RNA family types (excluding tRNAs and rRNAs) and
categorises these non-coding RNAs into nine biological related categories.
The following are databases that are specific to a special niche. miRbase [38] is a
database specifically for miRNAs and lists detailed information on both pre and mature
miRNA structures along with a target prediction pipeline. piRNABank [66] is a database
specifically for PIWI interacting RNAs. Sno/scaRNAbase [138] is a curated database for
nucleolar RNAs and cajal body-specific RNAs. NRED [25] is a database containing only long
non-coding RNAs 200 nucleotides or larger taken from microarray and in situ hybridisation
experiments for the mouse and human. ncRNAimprint [141] is a database of mammalian
non-coding RNAs that are imprinted. lncRNAdb [2] is a database for long non-coding
CHAPTER 3. RELATED WORK 13
RNAs that have biological functions in eukaryote cells and viruses, which include functional
mRNAs.
Chapter 4
Classification
The goal of this thesis is to create a practical, accurate and reliable classifier that can
distinguish different classes of transcript sequences from noisy data in real biological settings.
In particular we classify protein coding from non-protein coding RNA, in data derived from
RNA-Seq experiments, i.e. from short sequence reads. Using de novo assembly we generate
transcript contigs that represents the transcriptional landscape.
This chapter describes the concepts of the various aspects of our classifier, SSGC, which
aims to fulfil these goals. Section 4.1 describes concepts of the RNA-Seq reads and their
pre-processing. Section 4.2 describes the features used to classify input sequences. Section
4.3 describes the concepts of the classification and how its performance can be assessed.
4.1 Preprocessing reads
The output of the RNA-Seq procedure consists of very short fragments of RNA sequences.
As we are interested in working with long sequences that depict transcripts, we utilise the
process of assembly to build contig sequences.
4.1.1 Assembly
Assembly is a process in which contiguous sequences, or contigs, are created by piecing
together smaller sequences. ABySS [120] is a popular assembler program as it has been
successfully demonstrated on transcriptome sequencing [9]. ABySS is based on the de Bruijn
graph model, first introduced by Pevzner et al. [100]. This method fits into the category of
14
CHAPTER 4. CLASSIFICATION 15
de novo assemblers, i.e. one that uses only the short read sequence information, without
any external data source such as the reference sequence.
De Bruijn graphs using short read sequences rely on a given value k, such that sequencing
reads are chopped up into k-mers, or k length subsequences. Each k-mer is represented in
the graph as a node, directed edges represent k − 1 overlaps between adjacent k-mers, and
the paths traversed along edges represent contiguous sequences or contigs assembled from
sequenced reads. One of the challenges with de Bruijn based assemblers is that depending
on the coverage and the value k, this can lead to a high number of fragmented or non-
contiguous contigs [9], though some fragmentation is unavoidable due to repeats and low
coverage. It is also unclear if assembly is the sole cause of fragmentation as it can also
be argued that cDNAs such as those found in the FANTOM database are also fragmented
versions of longer transcripts [35].
To reduce the amount of fragmented short contigs, a merging technique has been shown
to be successful [110]. This technique is based on the strategy of assembling a large set of
contigs using multiple k-mer values, then removing all contigs where it is a perfect subse-
quence of another contig. This procedure is also accompanied by a filtering step to further
reduce the number of small contigs.
4.1.2 Mapping to RNA database
Our approach is to not only run, but to train the classifier using contigs; contigs must be
assigned a label from the class definitions. After assembly, contigs sequences are mapped
to protein coding and non-coding RNA databases. Based on the mapping criteria and
threshold set, subsets of contigs inherit the labels of the elements in the databases (Figure
4.1). In the case of multiple mappings, contigs are assigned labels in a greedy manner,
based on mapping score. The resulting set of labelled contigs are used to train and test the
classifier.
To assess the performance of the classifier on contig sequences, we first create class labels
for each contig sequence. This is done by mapping each contig sequences to known protein
coding and non-coding sequences based on mapping scores. This is performed by using the
BLAT aligner [61] between the annotated database entries and the contig set. For each
contig-annotation pair, we can choose to accept or reject the pairing by comparing BLAT
alignment parameters batc and bata, for contig and annotation respectively, to threshold
values. The parameters are calculated as: batc = numbasesmatch/lengthcontig, and bata =
CHAPTER 4. CLASSIFICATION 16
RNA-Seqreads
assemble&
merge
protein codingmRNA database
non-coding RNAdatabase
contigs
0.85 ; 0.83
0.88 ; 0.87
0.71 ; 0.70
0.79 ; 0.77
0.93 ; 0.95
map
0.84 ; 0.91
Figure 4.1: Overview of the contig assembly and labelling procedure. From short readtranscriptome reads, contigs are assembled and merged. Contigs are mapped individuallyto protein coding and non-coding RNA datasets. Contigs inherit the labels of the databaseelements with the best matched mapping score, which must be above a set threshold. Foreach mapping score, there are two threshold values, one for the contig and one for theannotation. The labelled contigs are used as training and testing sequences for the classifier.
CHAPTER 4. CLASSIFICATION 17
numbasesmatch/lengthannotation. To find the best annotation mapping for a given contig,
we choose the annotation with the highest score calculated by score = batc + bata. The
procedure of assigning contigs to annotation consists of the following steps: set a threshold
between 0 and 1; calculate the score for each contig and annotation pair with each bat term
above the threshold; from the highest to the lowest score, label the contig as the annotation
and remove all future instances of the contig and annotation from consideration.
4.2 Feature extraction
Given a set of sequences, the classifier attempts to distinguish the set into classes, whether
that be protein coding and non-coding, or non-coding RNA family types. This is done by
extracting features, or properties attained from the sequence. This section describes the
features used by the classifier. The features are categorised as sequenced based features,
structure based features, and genomic map based features, represented in Figure 4.2 and
further expanded in Table 4.1. The following sections describe the features at a conceptual
level, and section 5.1 provides further details on the implementation.
4.2.1 Sequence based features
Various methods found in the literature have explored features directly computed from the
sequence itself. The functional unit of proteins are the peptides folded in a three dimensional
manner while the functional unit of many non-coding RNAs are the their secondary struc-
ture. The selection pressures of the functional units are responsible for many features that
are embedded in the sequence information of coding and non-coding RNA transcripts [117].
This section explains the methods involved extracting sequence based features from a given
sequence.
Nucleotide usage
From the four nucleotides that make up the alphabet used in RNA, there are reports of
certain biases in the nucleotide composition of certain transcript types. One way to measure
the composition is to compare the distribution of unigrams, bigrams, and trigrams for the
entire length of the transcript. This itself creates 84 vectors representing each possible
word: 64 possible trigram combinations, 16 possible bigrams, and 4 possible unigrams. An
CHAPTER 4. CLASSIFICATION 18
Reads
Contigstraining
Sequence Secondary structureGCLengthORFNuc. comp.…
Genome mappedLoop lengthBulgesStem lengthLoop GC…
Exon coverageConservationChromatin…
SVM
Model
Contigstesting
ReferenceGenome
Figure 4.2: The classification approach starting from the sequence reads down to the testingof RNA transcripts. We propose a classifier that draws on three categories of features basedon sequence, secondary structure, and genome mapped data, which we name the Sequence-Structure-Genome Classifier (SSGC). For de novo experiments, we only consider sequenceand secondary structure based features.
CHAPTER 4. CLASSIFICATION 19
NumberCategory Feature name of features
SequenceGC Content 1Length 1Nuc. composition (1,2,3-mers) 84
Sequence - ORF
ORF size 1framefinder 6Comp Entropy 1Isoelectric point 1Mean hydropathy 1a.a. composition 20
Secondary
Total MFE 1Best MFE window 1Min. stem energy 1Stem length 1Stem GC 1Stem loop GC 1Stem bulge asym 1Stem bulge sym 1Stem bulge total 1Stem max bulge 1Triplet-SVM feats 32
Genomic Num exons 1
Genomic - Conserv
Exons conserved 1Total score 1Bases conserved 1Bases conserved with coverage 1Mean coverage 1
Genomic - Histone
Exons conserved 1Total score 1Bases conserved 1Bases conserved with coverage 1Mean coverage 1
Total 169
Table 4.1: Features available from the prediction model. Sequence and secondary basedfeature make up the de novo set of features. The concepts of the features are described insection 4.2, and the implementation in section 5.1..
CHAPTER 4. CLASSIFICATION 20
alternative is to compute the single feature, GC content (essentially the merging of two
bins, C and G divided by the total number of nucleotides), that has been used in the
past to distinguish coding from non-coding transcripts [67, 104]. These use the tendency
that protein coding GC content is approximately 50%, statistically distinct from intergenic
sequences [79, 24].
Length
Among the non-coding RNA families, two classes, tRNAs and miRNA stand out as they
have a well defined structure and length. [1] As such, mining for these particular non-coding
RNAs in a large dataset has shown to be possible by restricting the length of the transcript
and/or the secondary structure [67, 47]. Non-coding RNAs can vary greatly in length,
with transcripts smaller than 200 nucleotides are often associated with microRNA, PIWI-
associated RNAs, endogenous small interfering RNAs [25]. RNAs in the long non-coding
RNA class have transcripts in the same order of magnitude as protein coding genes with
some transcripts as large as a hundred kilobases in length [99].
ORF features
Protein coding mRNAs have characteristics that are well defined: they have a 5′ cap, 5′
and 3′ untranslated regions, an open reading frame and a polyadenylated tail [1], refer to
Figure 2.1. The portion of RNA that becomes translated to a peptide sequence is called
the open reading frame (ORF) and this is mostly thought to be unique to that of protein
coding genes; exceptions to this rule are bifunctional RNAs which are documented to have
functioning ORFs [25, 2].
A crude way to detect ORFs within a transcript sequence is to search for the longest
ORF from within one of the 6-frame translations, those that begin with the start codon
and end with the stop codon. There are much better and robust methods as proposed by
Slater et al. [121] and Shimizu et al. [116] that use machine learning methods that take into
account erroneous input sequences and frameshifts.
Once an ORF is predicted, we can investigate the protein coding biases such as the
log-odds score, compositional entropy, the amino acid composition, isoelectric point, and
mean hydropathy. However, there is a drawback such that if a protein coding gene’s ORF
is mis-predicted, the following features will likely yield poor results.
CHAPTER 4. CLASSIFICATION 21
The amino acid composition is the makeup of amino acids used for the peptide sequence,
this can be measured as a histogram of amino acid unigrams. This can be a crude measure
to distinguish from the assumed random peptide sequence expected from a non-coding
RNA. The log-odds score is an effective and often used measure of the likelihood that a
given sequence is not from a random source. This makes use of the fact that of the 64
possible codon triplets, there are heavy biases in the usage found in nature. By measuring
the in frame nucleotide usages, the log-odds score gives a measure to the quality of the
sequence [137].
Compositional entropy is another term to describe the degree of low-complexity regions
that can occur in a peptide sequence of the ORF. Low complexity regions are repetitive
or homopolymeric sequences such as Ser, Asn, Gln, Asp, Glu and Thr residues [37] found
in peptide sequences that code for peptides in nonglobular domains. These can consist of
repetitive sequences found in the peptide. Although their function is not known, this is a
well documented trait found in many protein coding genes [101].
An isoelectric point for a protein is the pH in which it has no net charge. By examining
the amino acid side chains of a peptide, the buffering characteristics can be determined at
different pH levels. Since living systems have very narrow ranges of pH, it is expect that
peptide sequences would also have a narrow range of isoelectric points to be useful in a
living organism [1].
Hydropathy is used here to measure how hydrophobic regions of as peptide are, i.e.
whether they are polar or non-polar depending on the side chains of the amino acids used.
Kyte and Doolittle [65] proposed a method to calculate the hydropathy character of a
protein. Here we use the mean hydropathy across the entire length of the peptide sequence,
which may be problematic due to peptides hidden in globular pockets in a folded protein
structure.
4.2.2 Secondary structure based features
RNA secondary structure
Some non-coding RNA types are known to have secondary structure that are key to their
function, such as ribosomes and tRNAs. Here we assume that there are no significant
CHAPTER 4. CLASSIFICATION 22
secondary structures associated with protein coding RNAs. From a long chain of ribonu-
cleotides, secondary structures result from segments of intramolecular base pairing, result-
ing in distinguishable structure such as stems, loops and bulges. Given a ribonucleotide
sequence, the most likely secondary structure would be the one with the lowest free energy
among all candidate sequences. However, to compute all possible candidates is unfeasible
due to the sheer size of the structures possible [142]. Lyngs and Pederson [78] show that
prediction of secondary structures taking into account pseudo-knots is NP-complete.
Zuker and Stiegler [143] describe a O(n3) dynamic programming algorithm under the
conditions that it assumes a simplistic thermodynamic model and it disregard pseudo-knots.
The Vienna package [50] contains an implementation of this global secondary structure in
addition to a O(nl2) local secondary structure prediction that only considers sub-structures
within a sliding window of size l of the input sequence. It has been shown that non-
coding RNAs can be reliably detected solely by using local structures such as hairpins and
stemloops [31].
We examine RNA folding ability for each of the transcripts by predicting the pseudo-
knot free secondary structure. From its success in distinguishing miRNA and pre-miRNA,
we focus on the quality of stem loops as shown in Xue et al. [139] and Hackenberge et
al. [42]. By extracting the longest stem loops, these methods are able to extract features
based on the length, GC content, number of symmetric and asymmetric bulges and structure
motifs and feed them to a machine learning program to do their predictions. In addition
to these features, we also extract the triplet-SVM features proposed by Xue et al. [139].
By feeding in a secondary structure represented by an alphabet of brackets and dots and
the ribonucleotide sequence, we can compute the occurrence of each of the eight possible
trigrams (combinations of dots and brackets) for each of the four RNA bases that represent
the middle character of eight possible trigrams: [(((, ((., (.(, (.., .((, .(., ..(, and ...].
There is clearly a potential in investigating secondary structures but at the same time
a limitation of exclusively examining dynamic programming solutions. One of the major
drawbacks is that dynamic programming solutions work to get the minimum free energy
structure; however, the biologically functional RNA product is not always the candidate
structure with the minimum free energy [115].
Another practical issue is that computing structural motifs will be very computationally
expensive. It is expected that many large transcripts will significantly increase the running
time. In that case, we have two alternative options, either to only compute small contigs
CHAPTER 4. CLASSIFICATION 23
below a certain size cutoff, or to run only localised structure predictions in a sliding win-
dow. Both strategies can potentially limit the structures predicted, and can additionally
be affected with the selection of size thresholds and step sizes. Our approach utilises the
sliding-window approach in the experimentation.
4.2.3 Genomic map based features
Genomic mapped strategies uses data that are mapped onto the genome coordinates. With
the ability to map transcripts back to the originating genome, several pieces of information
become available. The two strategies used in this thesis are to observe the splicing patterns
of a transcript as well as mining data associated with the bases mapped to a transcript’s
genomic coordinate. As such, we are limited to using data for a species with a known
reference genome, thereby excluding its use from de novo type experiements.
For this thesis, we focus on extracting features relating to the number of exons predicted
and mapped as well as extracting data from the regions each transcript or contig maps to,
namely scores relating to evolutionary conservation and histone modifications explained in
the subsequent sections.
Evolutionary conservation
Genomic conservation is a tool to measure evolutionary distance between two or more species
for a particular location. Incorporated in our classifer, it is useful to measure specific
sequences on the genome that are conserved in order to detect functional regions in the
genomes [44, 75, 12, 60, 82, 136]. Analysing sequenced genomes and data from comparative
genomic studies, it has been shown that large portions of the genome are functional elements
that have not been identified [19, 15, 21, 20, 118, 89].
Two algorithms are often used to measure the conservation between species at a base-by-
base level on a reference genome: VISTA [34] and Phastcons [118]. Phastcons is an HMM
based program that uses phylogeny and genome alignments calculate conservation between
multiple species where VISTA calculates conservation between pairs of species.
In the context of classification, it is widely accepted that protein coding RNAs are
conserved [1], however there are inconsistent reports of conservation levels between protein
coding RNAs and non-coding RNAs. Studies have shown that long non-coding RNAs are
conserved across species in varying degrees [5, 17, 39, 57]. In contrast, it has also been
CHAPTER 4. CLASSIFICATION 24
reported that conservation in only short non-coding RNAs are expected while longer non-
coding RNAs will not [98].
Histone modification data
The development of next-generation sequencing has not only provided more throughput and
smaller costs, it has found its way into many different applications. Chromatin immunopre-
cipitation (ChIP) is one such technology that utilises this powerful sequencing technology.
First described by Solomon et al. [123], ChIP uses cross linking between protein and DNA
to find a genome wide maps to where transcription factors bind. ChIP-Seq expands this
method by introducing next-generation sequencing and mapping to rapidly determine a map
of transcription binding sites [109].
Using ChIP-Seq technology, discovering sites of histone modifications associated with
gene expression has shown to be successful in studying their transcription factor bind-
ing [108]. In addition, chromatin state maps [88] have also been used to discover a large
set of long intergenic non-coding RNAs [39]. In this thesis, we investigate the effect of
our classifier using chromatin state maps for our task of distinguishing protein coding and
non-coding RNAs.
4.3 Classification
The primary goal of the classifier is to accurately detect whether an input RNA sequence
originated from a protein coding or a non-coding gene. The secondary objective is to further
classify a sequence that is predicted to be non-coding into its non-coding RNA family types.
To make the decision, the classifier makes use of features extracted from the three categories
of features described above. We investigate the classifier in two settings: one to assess the
performance by performing cross-validation of all contigs that map to known annotated
protein coding and non-coding sequences, and the other by running the classifier on the full
contig set to create a list of contigs ranked by prediction confidence. In both the training and
testing steps, features are processed and are ultimately fed into a support vector machine
that makes up the classifier model.
CHAPTER 4. CLASSIFICATION 25
4.3.1 Support vector machines
The main engine used in determining the class and family types of RNA is a support vector
machine (SVM), a popular method used in classification, regression and novelty detec-
tion [10]. They have become particularly useful in classification problems in computational
biology due to their high accuracy, robustness with large, high-dimensional data and flexi-
bility in diverse data sources [7]. SVMs model classification problems by representing data
as points in high dimensional space. Within that space, SVM models learn a hyperplane
which maximally separates the two classes of a training dataset. SVM models are then used
to classify new instances [22, 135].
4.3.2 Performance evaluation
A standard procedure to assess the accuracy of a model consists of splitting a dataset into
training and testing sets; a model is created with the training set and are evaluated with
the test set. Cross validation is an alternative to this approach that uses multiple rounds of
classification and testing. This is especially useful when the size of the dataset is limited.
One such type is K-fold cross validation. It is performed by splitting the dataset into K
partitions, an SVM is trained using K − 1 partitions and evaluated with the remaining
partition. This is repeated for all partitions [22, 135].
For our thesis, we utilise cross validation to assess the performance of the classifier in
both the binary and multiclass classification problems. As SVMs are binary classifiers that
can only handle two classes, multiclass problems are addressed using strategies that combine
multiple rounds of one-against-one or one-against-all classifications combined with voting.
For our classifier, we rely on the one-against-one implementation [54].
For each classification experiment, the accuracy, precision and recall are calculated.
These are evaluated based on the true counts (TP and TN) and the false counts (FP, FN)
from the confusion matrix (Table 4.2).
Accuracy is a measure of the total number of correct predictions from the total sample
size [96].
Accuracy =TP + TN
TP + TN + FP + FN
Precision is a measure of accuracy for the true positives from all samples predicted as
true [96].
Precision =TP
TP + FP
CHAPTER 4. CLASSIFICATION 26
Recall is a measure of all true positives that were correctly predicted from all samples
that are actually true [96].
Recall =TP
TP + FN
Predicted Class
Positive Negative
PositiveTrue Positive False NegativeCount (TP) Count (FN)
ActualClass
NegativeFalse Positive True NegativeCount (FP) Count (TN)
Table 4.2: Confusion matrix (or coincidence matrix) for a two-class classification problem.The correct predictions, true positive and true negative, are shaded while the erroneouspredictions, false positives and false negatives, are not.
4.3.3 Cross validation evaluation
We evaluate the performance of the classifier on annotated sequences. We investigate the
performance of the classifier on sequences with known class. This allows the ability to
evaluate the performance of the classifier under different settings.
Binary coding vs. non-coding classification
SSGC is applied on binary classification, the ability to differentiate coding from non-coding
RNA sequences. Physically, both sets of sequences can be similar as they are composed
of the same alphabet and overlap in sequence size. Using the features of the SSGC, we
demonstrate its ability in predicting the class of input sequences. This is performed using
SVMs with cross validation on sequences with known classes or on annotated contigs.
CHAPTER 4. CLASSIFICATION 27
Multiclass RNA family classification
Many strategies found in the literature perform their classification based on the two crude
classes of non-coding RNA and protein coding mRNA. This can be a naive approach as
non-coding RNA have many family types that differ in size, structure and function. Our
classifier attempts to distinguish not just non-coding RNA from protein coding RNA, but
within the multiple non-coding families. Some family types that we apply our classifier to
include piRNA, miRNA, pre-miRNA, snoRNA, snRNA, tRNA, rRNA. To solve this multi-
class problem, we look to a one-versus-one implementation of the support vector machine
classifier. In addition to the different classes, we investigate a multi-phase classifier that
performs multiclass classification once protein coding sequences are removed.
4.3.4 Full contig prediction
Applying the classifier on labeled sequences enables the ability to evaluate the classifier.
However, this limits its use on sequences already known and classified. In particular, its
application on assembled contigs can only be used for annotations that are mapped to
known sequences. Although the performance cannot be directly determined, we investigate
the ability to predict the class of the entire contig set.
Classification on the entire contig set is achieved by first training an SVM model using
the subset of sequences mapped to known sequences. The model can then be applied to the
entire contig set to predict the class and the confidence of each contig (refer to Figure 4.3).
4.3.5 Feature set ranking
We also investigate the effectiveness of our feature set. It is possible that some features will
not be available for some datasets. Also many features do not apply to all possible transcript
types. Notably, numerous features associated with ORFs of proteins do not apply to non-
coding RNA, and analogously, secondary structure do not apply to protein coding genes. If a
transcript can be identified as a protein coding gene, we would be uninterested in measuring
the degree of secondary structure, just as we would be uninterested in computing ORF
feature for non-coding RNA. Computing unneeded features can be a strain on resources.
We investigate the features that are the most effective in our classification experiments.
Once the feature set is assessed, we propose subsets of feature are called upon for certain
conditions. Ultimately, we envision a multiple step classifier, one that will have multiple
CHAPTER 4. CLASSIFICATION 28
Train model
Predict
Normalised feature vectors
full contig set
contigs mapped to proteincoding sequences
contigs mapped to non-coding sequences
protein coding
non- coding
Ranked contig predictionsby p-value
SVMmodel
Figure 4.3: Contig prediction procedure for the full contig set. A subset of contigs mapped toprotein coding and non-coding sequences from Ensembl and fRNAdb, respectively, are usedto train an SVM model. The SVM model is used to classify the entire contig set, predictingthe class and p-value for each contig. The p-value allows the contigs to be ranked, fromstrongly protein coding (0) to non-coding (1).
CHAPTER 4. CLASSIFICATION 29
feature extraction and classification steps. For this thesis, we are interested in separating
transcripts representing all genes, then to separate the transcript to the multiple classes, as
shown in Figure 1.1.
Chapter 5
Implementation
This chapter describes the steps taken to construct the classifier, and to run the experi-
ments. Section 5.1 describes the steps involved in computing the features from a set of
sequences. Section 5.2 describes the steps used to assess the classifier performance, predict
novel transcripts, and to rank the features used.
5.1 Feature extraction
The classifier is designed to distinguish one set of sequences from another using a number
of feature extraction strategies. Feature extraction was designed as a set of modular tools
that can be turned on or off depending on the data available, the effectiveness, the time
and space constraints of the system used. The central programs are accessible from the
command line and are controlled by using a set of arguments as well as a configuration files.
In total, 169 features are configured for the classifier, 159 are de novo and an additional
10 are genome based. Table 4.1 lists the features used by the classifier. These features are
fed to a support vector machine that makes up the core of the model building and decision
making process. The proceeding sections explain in detail each of the components used in
the feature extraction procedure.
5.1.1 Sequence based feature extraction
Programming for sequence based feature extraction was done in Perl in a UNIX environ-
ment. Perl was used to manage the components of the system, perform some of the feature
30
CHAPTER 5. IMPLEMENTATION 31
extraction calculations and used as the scripting language that utilised the classification
tools.
Perl was used for feature extraction for the following feature types: GC content, length,
nucleoide composition, amino acid composition, ORF analysis, and through the BioPerl
libraries [124] isoelectric point and mean hydropathy. The pH of the amino acid side chains
used to calculate the isoelectric point were based on the values found in the EMBOSS
toolkit [106]. Mean hydropathy was calculated by using a BioPerl implementation of the
method proposed by Kyte and Doolittle [65].
To extract the ORF from a transcript or contig sequence, the ESTate package [121] was
used as it is specially tailored to handle potential sequencing and frameshift errors in the
input data making it ideal for assembled contigs. The training data was used to extract the
word usage and probabilities, and framefinder was used to do the ORF extraction and was
used to calculate the log-odds score.
Low-complexity regions were detected using the Compositional Bias Detection Algo-
rithm [102] using the default values. The compositional entropy feature was calculated by
taking the number of masked residues divided by the total length of the ORF.
5.1.2 Secondary structure feature extraction
We examine the ability of RNA folding for each of the transcripts using tools from the
Vienna package [49, 50]. We have the option of running either full secondary predictions
using RNAfold or to run local secondary structure using RNALfold. In the interest of
running time, we perform all our tests using local secondary structure prediction, with the
span size set to 150 bp.
From the output of these structure prediction programs, we extract the longest stem loop
by using a modified version from code available from Xue et al. [139]. This also gives us the
32 triplet-SVM features, which are 3 character motifs from the structure sequence made up
of dots (mismatches) and brackets (matches) for each of the four possible bases A, C, G, and
U. Once we extract the longest stem loop, we extract features for the stem length, minimum
free energy in hairpin, loop length, loop GC, asymmetric bulges, symmetric bulges, and the
longest bulge.
CHAPTER 5. IMPLEMENTATION 32
5.1.3 Genomic map based feature extraction
For non-de novo experiments, where we have the reference sequence available, we can observe
the splicing patterns of the transcript, and take account the number of exons as well as
their placement. For assembled contig sequences, genome coordinates are predicted using
BLAT [61] for each contig, mapped to the mouse mm9 (NCBI m37) reference genome. For
multiple genomic candidates, a single coordinate is chosen based on the highest score:
score = nmatch − nmismatch − nqueryinserts − ntargetinserts
Using the information from BLAT, the best alignment for each contig sequence can
then be used to predict the number of exons present as well its coordinate on the reference
genome.
Evolutionary conserved regions
Genomic conservation is used to score mapped regions of transcripts. This value is calculated
using Phastcons [118], the multi-species conservation algorithm. The values used were
based on the mm9 mouse model trained on 30 vertebrate species available from the UCSC
server [105]. The conservation scores taken from each individual base pairs from mapped
regions are used to calculate the mean conservation score across all exons, the proportion
of transcript with conservation, and number of exon blocks with conservation.
Histone expressed regions
Similar to the evolutionary conserved regions, mapped regions can be used to calculate
scores based any method that can be mapped to the reference genome. We apply this
method using signals derived from ChIP-seq profiles for histone modifications for H3 lysine
4 trimethylation signals on the an adult mouse liver library [108]. The score is calculated
as the number of aligned tags from a the Chip-Seq experiment divided by the overall length
of the transcript.
5.2 Classification
From a set of features extracted from a sequence, classification is performed to ultimately
predict the class of the set.
CHAPTER 5. IMPLEMENTATION 33
5.2.1 Support vector machine
We used LIBSVM [14] under the WEKA [43] WLSVM implementation [29]. Features were
extracted in the same way for both the training and testing datasets. Missing values were
replaced with weka.filters.unsupervised.attribute.ReplaceMissingValues, all entries were nor-
malised to values ranging from -1 to 1.
Cross validation
All cross validation experiments used five folds with the following settings: S = 0, K = 2,
D = 3, G = 0.0, R = 0.0, N = 0.5, M = 40.0, C = 1.0, E = 0.0010, P = 0.1, i, B.
Full contig classification
To classify the entire contig set, an SVM model was trained using a balanced subset of
contigs that mapped to Ensembl protein coding and the fRNAdb non-coding sequences
using 0.8 as the threshold cutoff. The SVM model was created using the same settings as
above. The resulting SVM model was applied to the feature set of all contigs using the
settings: p = 0, distribution.
Feature ranking
For feature ranking, the information gain ranking filter InfoGainAttributeEval with setting
x = 10 was used with search method Ranker with settings: T = 1.797 693 134 862 315 7× 10308,
N = 10.
Chapter 6
Experimental results
This chapter describes our experimental results. We first assess the performance of our
Sequence-Structure-Genome Classifier (SSGC) for the application reported in the literature:
binary coding vs. non-coding classification using sequences in annotated databases. We
compare the performance of our classifier and an alternative program, PORTRAIT [3].
We then report the results of our extensions. We extend binary classification to multiclass
classification of different types of non-coding RNA, and show that our classifier is potentially
useful in this setting. We present our findings for RNA-Seq data and de novo transcriptome
assembly, using seven datasets from a range of mouse tissues and developmental stages. We
quantify the expression level of annotated transcripts of a range of Ensembl biotypes using
a reads-to-genome mapping procedure, then determine how many annotated transcripts of
which biotypes map to assembled contigs, using a range of mapping score thresholds. We
note that the relatively low expression level of many types of non-coding RNAs may prevent
them from being efficiently sampled by de novo assembly. Our classifier takes as input a
collection of contigs, as well as the database elements that map to the contigs. From our cross
validation experiments, the performance of our classifier on these inputs are comparable to
its performance on biotype-annotated transcripts in public databases. SSGC is also applied
to the entire, mostly unlabelled, contig set and based on the p-value confidence scores, we
explore the ability to classify contigs and to find potentially novel coding and non-coding
entities. As we used more feature types than previously published binary classifiers, we
conclude the chapter by briefly evaluating our feature sets and examining which features
are important for binary and multiclass classification.
34
CHAPTER 6. EXPERIMENTAL RESULTS 35
6.1 Coding and non-coding databases
To assess the performance of the classifier, we obtained protein coding mRNA transcript
sequences and non-coding RNA sequences from a number of public databases. We com-
pared the effectiveness of our classifier with the competing method PORTRAIT [3], a high
performing classification method that computes features and uses a similar classification
model using LIBSVM.
In this section we present the results of our classifier applied to various sequence databases.
We then present our classifier results using multiple classes of non-coding RNA types. This
is done by performing all pairwise comparisons of non-coding RNA and then perform a
multiclass classification.
6.1.1 EMBL and Swissprot vs. non-coding
Preparation
For this dataset, protein coding mRNA sequences were obtained from a long sequence of
steps as first described in Arrial et. al [3]. 241,242 protein coding sequences were obtained
from Swissprot [11] release 51.0 31-October-2006. To reduce the number of similar and
over-represented protein sequences, we use CD-HIT [73] with cutoff set to 0.7, resulting in
118,398 entities. The Swissprot sequence IDs were used to obtain mRNA sequences from
EMBL [69] using the EBI DBIfetch tool. To further reduce the number of similar sequences,
BLASTCLUST [26] was run using the arguments p = F , S = 0.5, L = 0.5, W = 18. To
ensure compatibility with PORTRAIT, data sequences were restricted to lengths within 80
to 65,535 bp, resulting in a total of 53,834 mRNA sequences.
Non-coding RNAs were obtained from three databases, Rfam [36], RNADB [99], and
NONCODE [46]. Combined, there were 763,842 sequences. BLASTCLUST [26] was ap-
plied on the sequences with the same setting as for the protein coding sequences. Entries
outside the 80 to 65,535 bp range were removed. The total number of non-coding sequences
remaining was 60,849.
Classification
We compare our classifier with PORTRAIT [3], testing performance on the EMBL [69] and
Swissprot [11] datasets as protein coding and the combination of Rfam [36], RNADB [99],
CHAPTER 6. EXPERIMENTAL RESULTS 36
and NONCODE [46]. Table 6.1 summarises the results. For the EMBL test set, PORTRAIT
outperforms our classifier by scoring higher in accuracy, precision and recall. From this result
we conclude that PORTRAIT is a better classifier for this dataset.
SSGC PORTRAITSize Accuracy Precision Recall Accuracy Precision Recall
1000 91.6 0.92 0.916 95.6 0.952 0.96010000 93.16 0.93 0.935 96.6 0.960 0.97350000 93.81 0.94 0.941 96.7 0.963 0.972
Wt. Ave. 93.7 0.93 0.940 96.7 0.962 0.972
Table 6.1: SSGC performance compared with PORTRAIT for the dataset composed ofSwiss-prot and EMBL for protein coding set, and Rfam, RNADB and NONCODE for thenon-coding set. Precision and recall are shown for the non-coding class.
6.1.2 Ensembl protein coding vs. non-coding
Preparation
To simulate the full length mRNAs found in transcriptome studies, we also look to mm9 mR-
NAs obtained from Ensembl v60 [55]. From the range of biotypes available from Ensembl,
we consider sequences with the biotype protein coding, consisting of 88,186 sequences. In
the same manner as in the EMBL dataset in the previous section, we performed BLAST-
CLUST [26] using the same arguments and restricted the sequences to the same size ranges,
resulting in 46,261 total sequences.
The same non-coding RNA dataset consisting of 60,849 sequences explained in the pre-
vious section was used.
Classification
SSGC was compared with PORTRAIT [3] using Ensembl v60 [55] protein coding transcripts
as the protein coding set, and the same non-coding RNA set as in section 6.1.1. The results
are summarised in Table 6.2. In this case, SSGC outperforms PORTRAIT in terms of
accuracy, precision and recall. The different in performance between this dataset and the
last is striking. As the same non-coding set is used, and transcripts are clustered and size-
selected for both, the difference between the inputs are likely that the EMBL sequences
CHAPTER 6. EXPERIMENTAL RESULTS 37
contain purely the ORF containing portion of the mRNA while the Ensembl set contains
the full mRNA sequence including the UTRs. For the purpose of contig classification in the
transcriptome, we expect to see full-length mRNAs that include UTR sequences resemble
those in the Ensembl dataset.
SSGC PORTRAITSize Accuracy Precision Recall Accuracy Precision Recall
1000 93.4 0.93 0.938 87.3 0.892 0.84810000 92.28 0.92 0.932 89.0 0.905 0.87050000 92.92 0.92 0.937 89.3 0.909 0.873
Wt. Ave. 92.8 0.92 0.936 89.2 0.908 0.872
Table 6.2: SSGC performance compared with PORTRAIT for the dataset composed ofEnsembl protein coding, and Rfam, RNADB and NONCODE for the non-coding set. Pre-cision and recall are shown for the non-coding class.
6.1.3 Ensembl vs. fRNAdb
Preparation
To test the ability to distinguish a range of different non-coding RNA types, we look to
fRNAdb [63] for mouse mm9 sequences, downloaded March 1st, 2010. fRNAdb has in total
83,826 sequences divided into nine RNA types: fly-smallRNA, mat-miRNA, misc, piRNA,
pre-miRNA, rRNA, snoRNA, snRNA, and tRNA, containing 1664, 651, 31532, 48550, 597,
17, 735, 67, and 18 elements, respectively.
Protein coding sequences are made up of Ensembl v60 [63] with biotype protein coding
as before. To compare with smaller non-coding RNAs found in fRNAdb, no filtering was
performed based on similarity or size.
Classification
The previous sections presented our findings for the binary ‘coding vs. non-coding’ class
problem using exclusively de novo features. In this section we expand our methods to
incorporate two techniques: we compare the performance using the complete feature set
(which includes genome based features and the de novo feature set), and also to investigate
the multiclass problem by including several non-coding RNA types in our classification. Our
CHAPTER 6. EXPERIMENTAL RESULTS 38
investigation is performed using datasets from Ensembl [55] protein coding and the multiple
non-coding RNA types from fRNAdb [63].
We investigate our classifier performance using the entire feature set and the de novo
feature set for the binary class using Ensembl and fRNAdb. Table 6.3 presents the per-
formance of the classification. Using the full feature set results in a slightly better overall
performance.
Table 6.4 represents the results for the pairwise binary classification between Ensembl
protein coding elements and each non-coding element found in fRNAdb, using both all
features and only de novo features. The resulting accuracies are high for each pair of RNA
elements; the misc class has the lowest performance in classification.
Features Accuracy Precision [nc] Recall [nc]
all 96.3 0.966 0.976de novo 95.6 0.961 0.97
Table 6.3: Binary classification performance between Ensembl protein coding with allfRNAdb non-coding sequences. The first row represents the experiment where all featuresare used. The second row represents the experiment where only the de novo features wereused.
In addition to the pairwise binary classification between protein coding sequences and
all non-coding RNA types, pairwise binary classification was performed on each pair of
non-coding RNA. Table 6.5 presents the result of our tests using all features, and Table 6.6
presents the tests using strictly de novo features. The number of samples per class varies
and likely causes fluctuations in the precision and recall but overall, the feature sets used
are promising in this binary classification problem.
In addition to the binary pairwise classification experiments, we performed multiclass
classifications between non-coding RNAs both with and without protein coding sequences.
Table 6.7 represents the confusion matrix of the multiclass classification for the nine non-
coding RNAs types found in fRNAdb. The higher numbers along the shaded diagonal cells,
the true positives, indicate the potential usage of our classifier to be used on multiple non-
coding RNAs. However, we do observe a skew in predictions towards RNA types that are
heavily represented in fRNAdb. Having small test sets for some RNA elements alongside
very large test sets indicates potential limitations in our current multiclass methodology.
CHAPTER 6. EXPERIMENTAL RESULTS 39
Features Class 1 Class 2 Elements Accuracy Precision Recall[nc] [nc]
all
prot-coding fly-smallRNA 49789 99.9 0.980 0.996prot-coding mat-miRNA 48776 100.0 0.986 0.985prot-coding misc 79657 94.7 0.924 0.943prot-coding piRNA 96675 99.7 0.996 0.999prot-coding pre-miRNA 48722 99.7 0.911 0.807prot-coding rRNA 48142 100.0 1.000 0.765prot-coding snoRNA 48860 99.5 0.884 0.761prot-coding snRNA 48192 99.9 0.968 0.448prot-coding tRNA 48143 100.0 1.000 0.889
Average 57440 99.3 0.961 0.844
de novo
prot-coding fly-smallRNA 49789 99.9 0.980 0.996prot-coding mat-miRNA 48776 99.9 0.964 0.991prot-coding misc 79657 93.8 0.915 0.931prot-coding piRNA 96675 99.7 0.996 0.999prot-coding pre-miRNA 48722 99.6 0.862 0.762prot-coding rRNA 48142 100.0 1.000 0.765prot-coding snoRNA 48860 99.4 0.867 0.710prot-coding snRNA 48192 99.9 0.972 0.522prot-coding tRNA 48143 100.0 1.000 0.889
Average 57440 99.1 0.951 0.841
Table 6.4: Pairwise classification performance between Ensembl protein coding elements vs.each RNA type found in fRNAdb. The first half represents the results where all featuresare used. The second half represents the results where only de novo features were used,thereby excluding genome mapped information such as the number of exons and cross-species conservation scores.
CHAPTER 6. EXPERIMENTAL RESULTS 40
Class 1 Class 2 Elements Accuracy Precision Recall[nc] [nc]
fly-smallRNA mat-miRNA 2315 94.4 0.960 0.963fly-smallRNA misc 33196 99.9 0.991 0.996fly-smallRNA piRNA 50214 99.2 0.877 0.890fly-smallRNA pre-miRNA 2261 100.0 1.000 1.000fly-smallRNA rRNA 1681 99.9 0.999 1.000fly-smallRNA snoRNA 2399 99.8 0.999 0.999fly-smallRNA snRNA 1731 99.9 0.999 1.000fly-smallRNA tRNA 1682 100.0 1.000 1.000mat-miRNA misc 32183 99.9 0.973 0.983mat-miRNA piRNA 49201 99.5 0.815 0.823mat-miRNA pre-miRNA 1248 100.0 1.000 1.000mat-miRNA rRNA 668 99.9 0.998 1.000mat-miRNA snoRNA 1386 99.9 0.997 1.000mat-miRNA snRNA 718 99.9 0.998 1.000mat-miRNA tRNA 669 100.0 1.000 1.000misc piRNA 80082 99.6 0.998 0.993misc pre-miRNA 32129 99.4 0.996 0.998misc rRNA 31549 100.0 1.000 1.000misc snoRNA 32267 99.1 0.994 0.997misc snRNA 31599 99.9 0.999 1.000misc tRNA 31550 100.0 1.000 1.000piRNA pre-miRNA 49147 100.0 1.000 1.000piRNA rRNA 48567 100.0 1.000 1.000piRNA snoRNA 49285 99.9 1.000 1.000piRNA snRNA 48617 100.0 1.000 1.000piRNA tRNA 48568 100.0 1.000 1.000pre-miRNA rRNA 614 99.5 0.995 1.000pre-miRNA snoRNA 1332 95.7 0.958 0.946pre-miRNA snRNA 664 98.6 0.990 0.995pre-miRNA tRNA 615 99.5 0.995 1.000rRNA snoRNA 752 98.9 1.000 0.529rRNA snRNA 84 95.2 1.000 0.765rRNA tRNA 35 97.1 0.944 1.000snoRNA snRNA 802 97.3 0.975 0.996snoRNA tRNA 753 99.3 0.996 0.997snRNA tRNA 85 97.6 0.985 0.985Average 18629 99.1 0.984 0.968
Table 6.5: Pairwise classification performance using the complete feature set for fRNAdbnon-coding RNA. Precision and recall are only shown for the second class.
CHAPTER 6. EXPERIMENTAL RESULTS 41
Class 1 Class 2 Elements Accuracy Precision Recall[nc] [nc]
fly-smallRNA mat-miRNA 2315 93.3 0.953 0.954fly-smallRNA misc 33196 99.9 0.991 0.997fly-smallRNA piRNA 50214 98.3 0.814 0.635fly-smallRNA pre-miRNA 2261 99.9 0.999 1.000fly-smallRNA rRNA 1681 99.9 0.999 1.000fly-smallRNA snoRNA 2399 99.7 0.998 0.999fly-smallRNA snRNA 1731 99.9 0.999 1.000fly-smallRNA tRNA 1682 100.0 1.000 1.000mat-miRNA misc 32183 99.9 0.976 0.983mat-miRNA piRNA 49201 99.4 0.936 0.582mat-miRNA pre-miRNA 1248 100.0 1.000 1.000mat-miRNA rRNA 668 99.9 0.998 1.000mat-miRNA snoRNA 1386 99.6 0.992 1.000mat-miRNA snRNA 718 99.9 0.998 1.000mat-miRNA tRNA 669 100.0 1.000 1.000misc piRNA 80082 99.6 0.998 0.993misc pre-miRNA 32129 99.1 0.994 0.997misc rRNA 31549 100.0 1.000 1.000misc snoRNA 32267 98.7 0.990 0.997misc snRNA 31599 99.9 0.999 1.000misc tRNA 31550 100.0 1.000 1.000piRNA pre-miRNA 49147 99.9 0.999 0.999piRNA rRNA 48567 100.0 1.000 1.000piRNA snoRNA 49285 99.6 0.997 0.999piRNA snRNA 48617 99.9 0.999 1.000piRNA tRNA 48568 100.0 1.000 1.000pre-miRNA rRNA 614 99.0 0.990 1.000pre-miRNA snoRNA 1332 92.7 0.924 0.913pre-miRNA snRNA 664 97.1 0.975 0.993pre-miRNA tRNA 615 99.7 0.997 1.000rRNA snoRNA 752 98.9 1.000 0.529rRNA snRNA 84 98.8 1.000 0.941rRNA tRNA 35 97.1 0.944 1.000snoRNA snRNA 802 96.5 0.968 0.995snoRNA tRNA 753 99.3 0.996 0.997snRNA tRNA 85 97.6 0.985 0.985Average 18629 99.0 0.984 0.958
Table 6.6: Pairwise classification performance using de novo feature set for fRNAdb non-coding RNA, similar to Table 6.5. Precision and recall are only shown for the second class.
CHAPTER 6. EXPERIMENTAL RESULTS 42
Despite this, the results suggest that our method is a good initial step in classifying among
different non-coding RNA sets. The limitation is possibly a subject of further study.
Classified as Class(prediction) (actual)
a b c d e f g h i
959 46 0 659 0 0 0 0 0 a66 320 0 265 0 0 0 0 0 b2 1 31206 233 18 0 72 0 0 c
207 73 63 48204 0 0 3 0 0 d0 0 109 6 460 0 22 0 0 e0 0 11 3 0 1 2 0 0 f0 0 226 58 30 0 419 2 0 g0 0 30 10 1 0 11 15 0 h0 0 2 3 1 0 0 0 12 i
Table 6.7: Confusion matrix for the multiclass classification using fRNAdb RNA types,using the entire feature set. The cells represent the number of predictions for each type,the shaded cells represent the number of true positives. Each RNA type is labelled from ato i, representing in order: fly-smallRNA, mat-miRNA, misc, piRNA, pre-miRNA, rRNA,snoRNA, snRNA and tRNA.
6.2 The RNA-Seq dataset
Classification was performed on data derived from transcriptome sequencing experiments,
using contig sets created using the Trans-ABySS [110] pipeline.
In our analysis, we first examine the representation of coding and non-coding RNA
transcripts represented by RNA-Seq reads. This is done using two methods: a genome
mapping procedure that measures read coverage on annotated locations of Ensembl and
fRNAdb elements, then a direct mapping from assembled contig to annotation using a
range of mapping thresholds. Our results ultimately show that there are non-coding RNAs
represented as contigs, but that there are too few non-coding RNA types represented to
support multiclass classification. We continue our investigation on contig classification using
the binary ’protein coding vs. non-coding’ classes.
CHAPTER 6. EXPERIMENTAL RESULTS 43
6.2.1 Contig preparation
Contig sets were generated from six RNA-Seq libraries MM0490, MM0564, MM0566, MM0570,
MM0571, and MM0581. Each library consists of 50 bp paired-end poly(A)+ RNA as de-
scribed in Robertson et al. [110] These six libraries represents various developmental stages
and tissue types of C57BL/6J mouse. Table 6.8 lists the libraries along with their tissue of
origin, age, and the number of transcription reads sequenced.
Library Tissue Age Reads
MM0490 Liver E14.5 157MMM0564 Heart-Atrioventricular-Cushions E12.5 229MMM0566 Heart-Atrioventricular-Cushions E11.5 257MMM0570 Dorsal Aorta E11.5 217MMM0571 U and V Aorta E14.5 235MMM0581 Endoderm-Definitive E8.5 250M
Table 6.8: Six seven-lane RNA-Seq mouse libraries were exained.
6.2.2 Transcriptome reads mapped to the genome
We map the transcriptome reads to the mouse mm9 genome and calculate the read coverage
using the coordinates of each annotated element. This is done by mapping each read using
BWA [70] and SAMtools [71] to a modified mouse genome, one that contains pre-spliced
junctions between possible exon pairs as described in Morin et al. [90]. For this study,
these steps are taken for the Ensembl [55] v60 annotation for the mouse. Exon-exon junc-
tion coordinates are defined from Ensembl [55], Refseq [103] and UCSC known gene [53]
annotations.
The transcriptome reads are mapped to the genome and the coverage is calculated for
each annotation in Ensembl v60. Figures 6.1 and 6.2 show the breakdown of read coverage
for a set of non-coding RNA-related biotype annotations using MM0564 reads. Protein
coding annotations are well expressed, as expected, but the non-coding annotations have
varying amounts of coverage. Assembling transcripts de novo from an RNA-Seq experiment
requires higher read coverage than reference based methods [110]. From this mapping
experiment alone it is unclear what fraction of different non-coding biotypes will be available
as assembled contigs.
CHAPTER 6. EXPERIMENTAL RESULTS 44
1e−03 1e+01 1e+05
0.0
0.4
0.8
protein_coding
x
1e−03 1e+01 1e+05
0.0
0.4
0.8
lincRNA
x
Fn(x
)
1e−03 1e+01 1e+05
0.0
0.4
0.8
miRNA
xFn
(x)
1e−03 1e+01 1e+05
0.0
0.4
0.8
misc_RNA
x
Fn(x
)
1e−03 1e+01 1e+05
0.0
0.4
0.8
pseudogene
x
1e−03 1e+01 1e+05
0.0
0.4
0.8
rRNA
x
Fn(x
)
1e−03 1e+01 1e+05
0.0
0.4
0.8
snoRNA
x
Fn(x
)
1e−03 1e+01 1e+05
0.0
0.4
0.8
snRNA
xFn
(x)
protein_coding
−2 0 2 4
010
0025
00
lincRNA
−2 0 2 4
020
40
miRNA
−2 0 2 4
020
6010
0
misc_RNA
−2 0 2 4
05
1525
pseudogene
−2 0 2 4
020
040
0
rRNA
−2 0 2 4
05
1015
snoRNA
−2 0 2 4
020
6010
0
snRNA
−2 0 2 4
020
60
Distribution of transcript coverages for library MM0564
Figure 6.1: Read coverage for Ensembl broken down to biotypes, for RNA-Seq reads fromlibrary MM0564. Each biotype is represented as an ECDF and as a distribution of log10
read coverage.
CHAPTER 6. EXPERIMENTAL RESULTS 45
1e−03 1e−01 1e+01 1e+03 1e+05
0.0
0.2
0.4
0.6
0.8
1.0
ECDF of Ensembl v60 transcript readcoverage for RNA−Seq library MM0564
read coverage
cum
ulat
ive
fract
ion
protein_codinglincRNAmiRNAmisc_RNApseudogenerRNAsnoRNAsnRNA
Figure 6.2: Empirical cumulative distribution function representing the read coverage for aselect number of Ensembl biotypes mapped to the mm9 reference genome from Figure 6.1.
CHAPTER 6. EXPERIMENTAL RESULTS 46
6.2.3 Contig assembly and merging
Each RNA-Seq library was assembled and merged using Trans-ABySS [110], assembling the
reads for every even k-mer between 26 to 50, producing a set of contigs for each library.
One of the issues with de Bruijn based assemblers is that depending on the coverage and
the k-mer length k, this can lead to very fragmented and overlapping contigs. Here we
processed the contig sets using the contig merging method [110]. To prevent the potential
exclusion of non-coding RNAs in the merged dataset, we examined merged contig sets with
filtering turned both on and off. The resulting set of contigs are summarised in Table 6.9.
Number Min Max Ave. Med.Filter Library Reads of contigs size size size size N50
yes
MM0490 157,441,166 5,701,316 25 71,739 86.4 46 91MM0564 229,499,055 2,450,369 25 58,854 228.4 59 1,249MM0566 257,298,896 2,742,649 25 63,266 210.3 57 1,121MM0570 217,279,470 3,806,318 25 21,614 127.5 55 318MM0571 235,143,912 2,402,290 25 21,519 173.5 61 636MM0581 249,969,333 4,090,155 25 54,048 212.3 54 967
no
MM0490 157,441,166 36,277,159 26 71,739 53.7 35 44MM0564 229,499,055 20,198,978 26 63,440 75.2 35 163MM0566 257,298,896 23,935,262 26 63,266 71.7 36 94MM0570 217,279,470 37,449,860 26 21,614 51.8 37 45MM0571 235,143,912 32,104,934 26 21,646 52.8 38 45MM0581 249,969,333 29,613,817 26 56,733 91.7 37 407
Table 6.9: Six seven-lane RNA-Seq libraries were assembled, merged to create the contigsets. These contigs were used as input for the classifier.
6.2.4 Contig to annotation mapping
The unfiltered contig set from each library was mapped to known protein coding mRNAs and
non-coding RNAs found in the databases Ensembl and fRNAdb using a range of thresholds
from 0.7 to 1.0. Figure 6.3 represents the number of contigs that map to annotated protein
coding and non-coding elements set with different thresholds for filtered and unfiltered
contigs, repspectively. From this figure we make a number of observations. First, the number
of fRNAdb non-coding elements are mapped in lower quantities than Ensemble types, but is
still in the order of hundreds and are likely sufficient for classification experiments. Second,
CHAPTER 6. EXPERIMENTAL RESULTS 47
comparison between filtered and unfiltered contigs show that filtering appears to affect
non-coding RNA sequences in fRNAdb but not Ensembl sequence. Third, as the mapping
threshold increases, the number of annotated contigs drops quite uniformly for both coding
and non-coding transcripts; it is therefore not obvious whether a single threshold value is
practical to perform all our mapping and is a possible topic of future work.
We further investigate both coding and non-coding annotation sets by breaking down
individual biotypes (Figure 6.4) and non-coding RNA families (Figure 6.5). From these two
figures, it is evident that not all types are represented in this mapping, indicating that either
their transcripts are not mapped well with the contig set, or are not present at high enough
levels in the RNA-Seq library, given the protocol and sequencing depth. From Figure 6.5,
for thresholds between 0.7 and 1.0, there are not enough individual RNA types found in the
fRNAdb dataset to perform pairwise or multiclass RNA classification as was done in section
6.1.3. For classification using contig sets, we focus on the binary coding vs. non-coding
classification problem.
6.2.5 Contig cross validation
Feature values were computed from contig sequences in the same manner for database
annotated sequences in earlier sections. We have performed the mapping, feature extraction
and classification on all six transcriptome libraries. All have resulted in similar findings and
performance and for the interest of space and to avoid repetition, we choose not to include
all the results in this thesis.
Contigs were mapped to protein coding or non-coding sequences by using the mapping
criteria in section 6.2.4, resulting in a sets of contigs labelled as Ensemble protein coding
RNAs and fRNAdb non-coding RNAs, for a range of mapping thresholds from 0.7 to 1.0.
We performed binary classification between the labelled contigs. Table 6.10 summarises the
classification performances between labelled contigs derived from library MM0564 in the top
half. We also performed the same classification using the original annotation sequence that
each contig represented, presented as ‘DB elements’ in the lower half of the table. For these
experiments, the total accuracaies are quite consistent for both sets. Also, the precision and
recall of the non-coding sequences are low. This is most likely caused by the difference in
sample size, as there are more coding contigs than non-coding sets.
To avoid the effect on performance due to differences in sample sizes between the two
classes, a stratified test set is made so that each class is equal in size. Table 6.11 shows
CHAPTER 6. EXPERIMENTAL RESULTS 48
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.70 0.75 0.80 0.85 0.90 0.95 1.00
050
0010
000
1500
020
000
2500
0
Ensembl / filtered mapped
BLAT alignment thresholds
Num
ber o
f ann
otat
ions
MM0490 ●
MM0564MM0566MM0570MM0571MM0581
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.70 0.75 0.80 0.85 0.90 0.95 1.00
050
0010
000
1500
020
000
2500
0
Ensembl / unfiltered mapped
BLAT alignment thresholdsN
umbe
r of a
nnot
atio
ns
MM0490 ●
MM0564MM0566MM0570MM0571MM0581
a) b)
●●
●●
●●
●●
●●
●●
●●
● ●
0.70 0.75 0.80 0.85 0.90 0.95 1.00
050
010
0015
0020
0025
0030
0035
00
fRNAdb / filtered mapped
BLAT alignment thresholds
Num
ber o
f ann
otat
ions
MM0490 ●
MM0564MM0566MM0570MM0571MM0581
●
●
●
●
●
●
●
●●
●●
●●
●● ●
0.70 0.75 0.80 0.85 0.90 0.95 1.00
050
010
0015
0020
0025
0030
0035
00
fRNAdb / unfiltered mapped
BLAT alignment thresholds
Num
ber o
f ann
otat
ions
MM0490 ●
MM0564MM0566MM0570MM0571MM0581
c) d)
Figure 6.3: Number of unique contigs that map to the sequence annotation databasesfRNAdb and Ensembl using a range of mapping thresholds for all six mouse libraries. (a)and (c) represent the filtered contig set mappings, (b) and (d) represent the unfiltered contigset mappings.
CHAPTER 6. EXPERIMENTAL RESULTS 49
Source Threshold Elements Accuracy Precision Recall[nc] [nc] [nc]
Contigs
0.70 20981 95.4 0.857 0.7540.72 19984 95.6 0.855 0.7520.74 19047 95.9 0.865 0.7580.76 18102 96.1 0.869 0.7510.78 17160 96.5 0.874 0.7590.80 16299 96.7 0.881 0.7600.82 15322 96.8 0.883 0.7370.84 14418 96.9 0.885 0.7320.86 13442 97.0 0.886 0.7150.88 12371 96.8 0.872 0.6620.90 11126 97.0 0.865 0.6460.92 9765 97.3 0.855 0.6690.94 8175 97.8 0.852 0.7070.96 6243 98.1 0.875 0.7570.98 3638 98.7 0.879 0.8361.00 77 97.4 0.974 1.00
Average 12884 96.9 0.877 0.750
DB elements
0.70 21087 96.3 0.868 0.8400.72 20084 96.4 0.867 0.8350.74 19139 96.6 0.869 0.8380.76 18191 96.8 0.871 0.8320.78 17243 97.0 0.867 0.8370.80 16381 97.0 0.866 0.8310.82 15400 97.1 0.870 0.8150.84 14491 97.3 0.868 0.8140.86 13504 97.4 0.865 0.8100.88 12427 97.5 0.857 0.8010.90 11176 97.8 0.865 0.8110.92 9812 98.0 0.868 0.8130.94 8210 98.1 0.861 0.7970.96 6277 98.1 0.878 0.7780.98 3655 98.1 0.891 0.6871.00 91 100.0 1.00 1.00
Average 12948 97.5 0.877 0.821
Table 6.10: Classification performance using the contigs from the library MM0564, usingthe full feature set. The contig sets are mapped to protein coding sequences from Ensembl,and non-coding RNA sets from fRNAdb using a series of mapping thresholds. The top halfof the table represents the classification results using features extracted from the contigsequences. The lower half represents the classification results using the features extractedfrom the original sequence from either Ensembl or fRNAdb that each contig mapped to.
CHAPTER 6. EXPERIMENTAL RESULTS 50
0.70 0.75 0.80 0.85 0.90 0.95 1.00
110
100
1000
1000
0Ensembl filtered MM0564 contigs
BLAT alignment thresholds
Num
ber o
f ann
otat
ions
snRNA ●
snoRNArRNA
pseudogenemisc_RNA
miRNAlincRNA
protein_coding
0.70 0.75 0.80 0.85 0.90 0.95 1.00
110
100
1000
1000
0
Ensembl unfiltered MM0564 contigs
BLAT alignment thresholds
Num
ber o
f ann
otat
ions
snRNA ●
snoRNArRNA
pseudogenemisc_RNA
miRNAlincRNA
protein_coding
a) b)
Figure 6.4: Ensembl transcripts mapped by filtered (a) and unfiltered (b) MM0564 contigs,broken down into individual biotypes.
the performance of the classifier on this stratified set for the same contigs. In comparison
to Table 6.10 it is evident that the accuracy decreases slightly, but, at the same time, the
precision and recall rise to comparable levels with the accuracy.
The underlying difference in classification performance for the different threshold values
is not immediately clear. It is not clear whether this trend is a result of the rising threshold
values or simply due to the decrease in the number of elements tested. However, we note that
accuracy increases for the contigs as the threshold increases, while the database elements
do not change to the same degree. This suggests that the number of elements in the test
set is not responsible for the difference in performance. The only difference between these
values is the quality of the sequences, determined by the threshold values. Comparing the
performance between the contigs and the database elements shows that they converge to
approximately as the threshold goes to 1.0 (both to 96% in Table 6.11). Lower thresholds
produce lower classification results. This suggests that higher thresholds force the mapped
contigs to resemble real coding and non-coding sequences, improving the performance of the
classifier. But at the same time as the threshold increases there are fewer elements to train
CHAPTER 6. EXPERIMENTAL RESULTS 51
0.70 0.75 0.80 0.85 0.90 0.95 1.00
15
1050
100
500
fRNAdb filtered MM0564 contigs
BLAT alignment thresholds
Num
ber o
f ann
otat
ions
flysmallRNA ●
matmiRNAmisc
piRNApremiRNA
rRNAsnoRNAsnRNA
tRNA
● ●● ●
● ●
● ●
● ●
● ● ● ●
0.70 0.75 0.80 0.85 0.90 0.95 1.00
15
1050
100
500
fRNAdb unfiltered MM0564 contigs
BLAT alignment thresholds
Num
ber o
f ann
otat
ions
flysmallRNA ●
matmiRNAmisc
piRNApremiRNA
rRNAsnoRNAsnRNA
tRNA
a) b)
Figure 6.5: fRNAdb transcripts mapped by filtered (a) and unfiltered (b) MM0564 contigs,broken down into individual RNA types.
and test the classifier. From these observations, it again shows the difficulty in choosing
a suitable value or a set of values for the threshold. This is a major issue that must be
considered in order to perform the classification for raw contig sequences.
PORTRAIT was also used on the contig sets and the database annotations in the same
way that our classifier was used. Feature computation was not possible for the contig sets
due to software errors. However, we were able to extract the features from the database
elements mapped to the contigs. The results on the database elements for SSGC and
PORTRAIT are compared in Table 6.12. The accuracy is comparable for both methods in
the unbalanced set but are quite different for the stratified set where the protein coding and
non-coding elements were equal. This again illustrates the effect of unbalanced class sizes
in our dataset.
6.2.6 Full contig set classification
The cross-validation experiments in the previous sections were applied to labelled data sets.
From the tens of millions of contigs produced in the assembly, only tens of thousands were
CHAPTER 6. EXPERIMENTAL RESULTS 52
Source Threshold Elements Accuracy Precision Recall[nc] [nc] [nc]
Contigs - Strat
0.70 5226 93.4 0.933 0.9350.72 4738 92.8 0.927 0.9290.74 4308 94.1 0.942 0.9390.76 3870 93.3 0.933 0.9330.78 3462 93.8 0.936 0.9400.80 3124 94.2 0.943 0.9400.82 2734 94.0 0.936 0.9440.84 2438 94.4 0.944 0.9430.86 2120 94.7 0.950 0.9430.88 1814 94.2 0.940 0.9450.90 1484 94.6 0.947 0.9450.92 1196 95.3 0.956 0.9500.94 860 93.6 0.937 0.9350.96 668 94.6 0.954 0.9370.98 330 96.1 0.987 0.9331.00 4 - - -
Average 2558 94.2 0.944 0.939
DB - Strat
0.70 5398 95.3 0.946 0.9600.72 4900 95.1 0.943 0.9600.74 4462 95.6 0.949 0.9630.76 4018 95.7 0.951 0.9640.78 3602 95.3 0.946 0.9600.80 3262 95.2 0.945 0.9610.82 2864 95.6 0.948 0.9640.84 2558 96.2 0.955 0.9710.86 2222 96.4 0.959 0.9700.88 1904 95.9 0.950 0.9670.90 1566 95.9 0.957 0.9620.92 1274 95.5 0.956 0.9540.94 916 95.2 0.956 0.9480.96 722 96.1 0.961 0.9610.98 358 96.4 0.956 0.9721.00 4 - - -
Average 2502 95.7 0.952 0.962
Table 6.11: Classification performance for the stratified contigs from library MM0564, usingthe full feature set. In comparison to Table 6.10, the number of elements in each class areequal. The contig sets are mapped to protein coding sequences from Ensembl, and non-coding RNA sets from fRNAdb using a series of mapping thresholds. The top half of thetable represents the classification results using features extracted from the contig sequences.The lower half represents the classification results using the features extracted from theoriginal sequence from either Ensembl or fRNAdb that each contig mapped to. Note thatfor thresholds at 1.0, there are not enough elements to perform classification.
CHAPTER 6. EXPERIMENTAL RESULTS 53
SSGC PORTRAITType Threshold Elements Acc. Prec Recall Acc. Prec Recall
All
0.70 21087/20669 96.3 0.868 0.840 96.2 0.969 0.9890.72 20084/19689 96.4 0.867 0.835 96.4 0.970 0.9900.74 19139/18765 96.6 0.869 0.838 96.5 0.971 0.9900.76 18191/17827 96.8 0.871 0.832 96.6 0.973 0.9910.78 17243/16901 97.0 0.867 0.837 96.8 0.974 0.9920.80 16381/16062 97.0 0.866 0.831 97.1 0.976 0.9930.82 15400/15111 97.1 0.870 0.815 97.1 0.975 0.9940.84 14491/14224 97.3 0.868 0.814 97.2 0.975 0.9950.86 13504/13259 97.4 0.865 0.810 97.3 0.976 0.9960.88 12427/12197 97.5 0.857 0.801 97.3 0.976 0.9960.90 11176/10967 97.8 0.865 0.811 97.6 0.978 0.9970.92 9812/9621 98.0 0.868 0.813 97.5 0.977 0.9970.94 8210/8060 98.1 0.861 0.797 97.9 0.980 0.9980.96 6277/6124 98.1 0.878 0.778 98.2 0.983 0.9990.98 3655/3570 98.1 0.891 0.687 98.6 0.987 0.9991.00 91/6 100.0 1.000 1.000 100.0 1.000 1.000
Average 97.5 0.877 0.821 97.4 0.978 0.995
Strat
0.70 5398/4584 95.3 0.946 0.960 91.4 0.908 0.9200.72 4900/4132 95.1 0.943 0.960 91.2 0.902 0.9250.74 4462/3728 95.6 0.949 0.963 91.3 0.908 0.9180.76 4018/3304 95.7 0.951 0.964 91.3 0.909 0.9180.78 3602/2926 95.3 0.946 0.960 91.0 0.903 0.9190.80 3262/2630 95.2 0.945 0.961 91.1 0.905 0.9190.82 2864/2292 95.6 0.948 0.964 91.0 0.900 0.9210.84 2558/2030 96.2 0.955 0.971 91.1 0.901 0.9240.86 2222/1738 96.4 0.959 0.970 89.8 0.895 0.9020.88 1904/1450 95.9 0.950 0.967 89.4 0.886 0.9050.90 1566/1154 95.9 0.957 0.962 89.2 0.879 0.9080.92 1274/894 95.5 0.956 0.954 87.8 0.867 0.8930.94 916/616 95.2 0.956 0.948 88.6 0.881 0.8930.96 722/416 96.1 0.961 0.961 89.4 0.890 0.8990.98 358/188 96.4 0.956 0.972 93.1 0.926 0.9361.00 4/4
Average 95.7 0.952 0.962 90.5 0.897 0.913
Table 6.12: Classification performance for the database sequences mapped by the unfilteredcontig sets from MM0564; each classification is compared with PORTRAIT. The precisionand recall is only shown for the non-coding class. We were not able to compare the clas-sification accuracies for the actual contig sets themselves. Note the number of elements islower for PORTRAIT due to the size restrictions for their input.
CHAPTER 6. EXPERIMENTAL RESULTS 54
used in the cross validation experiments. In this section, we investigate the use of SSGC
applied on the full contig set. From the unannotated contig sequences, we attempt to use
the classifier predictions to find potential novel non-coding and protein coding transcripts
in the data.
We created an SVM model from 3124 annotated contig sequences that represent both
classes, in equal proportions, from the mouse library MM0564 using 0.8 as the mapping
threshold. The SVM model was applied on the entire contig set to obtain a class prediction
and a confidence value, the p-value (Figure 4.3).
Each contig is assigned a p-value from [0,1], where a value below 0.5 is classified as
protein coding and a value above 0.5 is classified as non-coding. Figure 6.6 represents the
distribution of contig predictions as well as the p-values. The p-values are skewed towards
non-coding values which have very high values, suggesting that the vast majority of the
assembled contigs are strongly non-coding. Figure 6.7 represents the mapping threshold
scores and sizes of contigs that are at either extreme of the p-value distribution, and therefore
likely non-coding or protein coding. We looked closely at possible novel non-coding and
protein coding contigs by examining sequences with p-values above 0.95 or below 0.05, and
that do not map to any known mm9 mouse fRNAdb and Ensembl protein coding sequences
using a BLAT alignment.
Our analysis of potential non-coding contigs, shows that many are found in intronic and
UTR regions of known genes. Using the UCSC Genome Browser [62], Figure 6.8 represents
one such contig, k50:177614, with p-value of 1.0, and has no BLAT alignments with any
sequences in fRNAdb and Ensembl protein coding. It is likely that this sequence is located
within a novel polyadenylation tail of the gene Fstl4. Although there is no evidence of the
sequence being functional, its location in the 3′ tail suggests that the classifier was correct
in classifying the contig as non-coding.
Figure 6.9 represents the alignment of contig k29:3267973 to the mm9 mouse genome.
The aligned RNA-Seq reads show pileups that resemble a spliced gene. In addition, the
exonic regions are highly conserved across some species. Figure 6.10 shows the contig with
the mouse sequence coordinate lifted from the mouse mm9 genome to the human hg18
genome using the UCSC LiftOver tool [62]. From the viewer, it is evident that one of the
exons is aligned to the AceView Gene Model glertee.aApr07. This suggests that the classifier
was correct in classifying the contig as protein coding.
Our analysis shows that many potential novel protein coding contigs are aligned to
CHAPTER 6. EXPERIMENTAL RESULTS 55
protein coding non−coding
MM0564 contig predictions0.
0e+0
05.
0e+0
61.
0e+0
71.
5e+0
72.
0e+0
7
non−coding RNA p−valuenu
mbe
r of c
ontig
s
0.0 0.2 0.4 0.6 0.8 1.0
0.0e
+00
5.0e
+06
1.0e
+07
1.5e
+07
0.0e
+00
5.0e
+06
1.0e
+07
1.5e
+07
a) b)
Contigs with no alignments
non−coding RNA p−value
num
ber o
f con
tigs
0.0 0.2 0.4 0.6 0.8 1.0
0e+0
01e
+06
2e+0
63e
+06
4e+0
65e
+06
6e+0
60e
+00
1e+0
62e
+06
3e+0
64e
+06
5e+0
66e
+06
Contigs ≥≥ 500bp
non−coding RNA p−value
num
ber o
f con
tigs
0.0 0.2 0.4 0.6 0.8 1.0
020
000
4000
060
000
8000
010
0000
1200
000
2000
040
000
6000
080
000
1000
0012
0000
c) d)
Figure 6.6: The full MM0564 contig set is predicted by the SVM model, and are assignedprobabilities. Contigs with p-values below 0.5 are classified as protein coding, while contigswith p-values above 0.5 are classified as non-coding. (a) is the class prediction for allcontigs. (b) is the p-value distribution of all the contigs, (c) is the p-value of contigs withno alignments to any known non-coding transcripts. (d) is the p-value for all contigs 500bpand larger.
CHAPTER 6. EXPERIMENTAL RESULTS 56
0.0 0.5 1.0 1.5 2.0
200
500
1000
2000
5000
2000
050
000
Contigs / p−value ≤≤ 0.05
protein coding mapping scores
num
ber o
f con
tigs
(log)
200
500
1000
2000
5000
2000
050
000
0.0 0.5 1.0 1.5 2.0
1e+0
31e
+04
1e+0
51e
+06
1e+0
7
Contigs / p−value ≥≥ 0.95
non−coding RNA mapping scoresnu
mbe
r of c
ontig
s (lo
g)1e
+03
1e+0
41e
+05
1e+0
61e
+07
a) b)
0 10000 20000 30000 40000 50000 60000
1e+0
01e
+02
1e+0
41e
+06
p−value ≤≤ 0.05 and no mapping
contig size (bp)
num
ber o
f con
tigs
(log)
1e+0
01e
+02
1e+0
41e
+06
1e+0
01e
+02
1e+0
41e
+06
0 10000 20000 30000 40000 50000 60000
1e+0
01e
+02
1e+0
41e
+06
p−value ≥≥ 0.95 and no mapping
contig size (bp)
num
ber o
f con
tigs
(log)
1e+0
01e
+02
1e+0
41e
+06
1e+0
01e
+02
1e+0
41e
+06
c) d)
Figure 6.7: Mapping scores and sizes of contigs strongly predicted as protein coding (p-value ≤ 0.05) and non-coding (p-value ≥ 0.95). a,b) Distribution of mapping scores withthe best-aligned a) protein-coding Ensembl sequence, b) non-coding fRNAdb sequence. c,d)Distribution of contig sizes (white). In (c), the red regions represent strongly protein coding(p-value ≤ 0.05) which do not map to any known sequences in Ensembl or fRNAdb. In (d),the orange regions represent strongly non-coding (p-value ≥ 0.95) which do not map to anyknown sequences.
CHAPTER 6. EXPERIMENTAL RESULTS 57
Scalechr11:
STS Markers
RefSeq Genes
Other RefSeq
Ensembl Genes
Spliced ESTs
RatHuman
OrangutanDog
HorseOpossum
ChickenStickleback
SNPs (128)
RepeatMasker
10 kb52995000 53000000 53005000 53010000 53015000
Contigs MM0564 BLAT chr11
MM0490 WTSS: E14.5 liver
MM0564 WTSS: E12.5 heart AV cushion
MM0566 WTSS: E11.5 heart AV cushion
MM0570 WTSS: E11.5 dorsal aorta
MM0571 WTSS: E14.5 umbilical & vitelline artery
MM0581 WTSS: E8.5 definitive endoderm
STS Markers on Genetic and Radiation Hybrid Maps
Your Sequence from Blat Search
UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics
RefSeq Genes
Non-Mouse RefSeq Genes
Ensembl Genes
Human Proteins Mapped by Chained tBLASTn
Mouse mRNAs from GenBank
Mouse ESTs That Have Been Spliced
Placental Mammal Basewise Conservation by PhyloP
Multiz Alignments of 30 Vertebrates
Simple Nucleotide Polymorphisms (dbSNP build 128)
Repeating Elements by RepeatMasker
k50:177614
Fstl4
FSTL4FSTL4
TAF13
AK046350AK081114AF374459BC132353BC144824AK220367
AK204007AK200446
BC018609
MM0490_7L3 _
0 _
MM0564_7L226 _
0 _
MM0566_7L271 _
0 _
MM0570_7L18 _
0 _
MM0571_7L32 _
0 _
MM0581_7L13 _
0 _
Mammal Cons
2.1 _
-3.3 _
0 -
Figure 6.8: Contig k50:177614 aligned in the mouse mm9 genome. The top track representsthe multiple contigs that are mapped to this location. The second set of tracks are thepileups for the RNA-Seq read alignments for the six mouse transcriptome libraries. Belowthe contig track is the gene track and the conservation track. This contig has a p-value of1.0 and does not map to any known non-coding or protein coding sequences.
CHAPTER 6. EXPERIMENTAL RESULTS 58
transcripts that are similar to previously known protein coding sequences, which are not yet
labelled as protein coding in the Ensembl database.
From these two simple examples, we demonstrate the ability of SSGC to detect potential
novel coding and non-coding contigs from the full contig set. From manual inspection,
sequences on either extremes of the p-value distribution do resemble real non-coding and
protein coding elements. However, SSGC’s ability as a gene finder, especially for novel
sequences, is potentially useful but is currently limited. For practical use, it would be
desirable to be able to distinguish a real transcript from an artifact from assembly, and to
distinguish functional from non-functional non-coding RNAs.
6.3 Feature ranking
We also investigate the effectiveness of the features used in the classification experiments
by ranking features for different conditions. Table 6.13 show the top twenty ranked features
for the classification experiments between Ensembl protein coding and fRNAdb non-coding
sequences as in section 6.1.3.
The first two columns represents the ranked features used in the binary classification
between coding and non-coding. ORF-related features are prevalent in the list, which is
understandable as non-coding sequences are not expected to have ORF sequences. We also
see the importance of the trigrams TAG and TAA in the first four columns. These are two
of the three stop codons within an ORF. We can also observe that a number of features
not available in the de novo set are important for this binary classification. This is again
understandable as we would expect the number of exons be important in identifying non-
coding RNAs. Conservation is also represented, further supporting the notion that protein
coding sequences are much better conserved than non-coding sequences in the genome.
The multiclass experiments are shown in the middle and the last pair of columns. We ob-
serve that once protein coding sequences are removed from the classifier (last two columns),
new features emerge in the list, notably for length and secondary structure. The length is
a key feature used to distinguish some of the smaller sized from the larger sized non-coding
RNAs. The secondary structure based feature ‘Total energy’ likely plays a larger role as
some RNA types are known to have very distinct confirmations.
We also examine the effectiveness in classification using subsets of the top-ranked features
using the information gain ranking filter. Table 6.14 represents the performance for the
CHAPTER 6. EXPERIMENTAL RESULTS 59
Scalechr7:
STS Markers
RefSeq Genes
Other RefSeq
Ensembl Genes
Spliced ESTs
RatHuman
OrangutanDog
HorseOpossum
ChickenStickleback
SNPs (128)
RepeatMasker
10 kb20045000 20050000 20055000 20060000 20065000 20070000
Contigs MM0564 BLAT chr7
MM0490 WTSS: E14.5 liver
MM0564 WTSS: E12.5 heart AV cushion
MM0566 WTSS: E11.5 heart AV cushion
MM0570 WTSS: E11.5 dorsal aorta
MM0571 WTSS: E14.5 umbilical & vitelline artery
MM0581 WTSS: E8.5 definitive endoderm
STS Markers on Genetic and Radiation Hybrid Maps
Your Sequence from Blat Search
UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics
RefSeq Genes
Non-Mouse RefSeq Genes
Ensembl Genes
Human Proteins Mapped by Chained tBLASTn
Mouse mRNAs from GenBank
Mouse ESTs That Have Been Spliced
Placental Mammal Basewise Conservation by PhyloP
Multiz Alignments of 30 Vertebrates
Simple Nucleotide Polymorphisms (dbSNP build 128)
Repeating Elements by RepeatMasker
k29:3267973
Mark4
MARK4MARK4MARK4
RPL34 XTP7
AK146784AY151083BC156720
MM0490_7L30 _
0 _
MM0564_7L186 _
0 _
MM0566_7L168 _
0 _
MM0570_7L560 _
0 _
MM0571_7L213 _
0 _
MM0581_7L97 _
0 _
Mammal Cons
2.1 _
-3.3 _
0 -
Figure 6.9: Contig k29:3267973 aligned in the mouse mm9 genome. Similar to Figure 6.8,the tracks represent the assembled contigs, RNA-Seq read pileups, the contig, known geneannotations, and conservation. This contig has a p-value of 0 and does not map to anyknown non-coding or protein coding sequences.
CHAPTER 6. EXPERIMENTAL RESULTS 60
Scalechr19:
k29:3267973
RefSeq Genes
Other RefSeq
Human mRNAs
Spliced ESTs
RhesusMouse
DogElephant
OpossumPlatypusChicken
LizardX_tropicalisStickleback
SNPs (130)
RepeatMasker
10 kb50425000 50430000 50435000 50440000 50445000 50450000
mm9 Lift Over
Your Sequence from Blat Search
UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics
RefSeq Genes
Non-Human RefSeq Genes
Ensembl Genes
AceView Gene Models With Alt-Splicing
Non-coding RNA Genes (dark) and Pseudogenes (light)Human mRNAs from GenBank
Human ESTs That Have Been Spliced
ENCODE Enhancer- and Promoter-Associated Histone Mark (H3K4Me1) on 8 Cell Lines
ENCODE Enhancer- and Promoter-Associated Histone Mark (H3K27Ac) on 8 Cell Lines
ENCODE Promoter-Associated Histone Mark (H3K4Me3) on 9 Cell Lines
ENCODE Digital DNaseI Hypersensitivity Clusters
ENCODE Transcription Factor ChIP-seq
Placental Mammal Basewise Conservation by PhyloP
Multiz Alignments of 44 Vertebrates
Simple Nucleotide Polymorphisms (dbSNP build 130)
Repeating Elements by RepeatMasker
k29:3267973
EXOC3L2 MARK4MARK4MARK4
ENST00000252482 ENST00000262893ENST00000300843ENST00000262891ENST00000377820
EXOC3L2.aApr07glertee.aApr07
MARK4.aApr07MARK4.bApr07
MARK4.hApr07gasee.aApr07-unspliced
6 51
3 2 37
3 465
554
2
23
2012
2
2
gPU.1 gPOU2F2KEgr-1
KHEY1KSTAT2PFOXP2KEgr-1
LHNF4AGSP1ACTCFGKRad21ggNFKB
GPAX5-C20KBrg1
HIni1
HBAF155
LKHEY1GPOU2F2
GEBFGEBF
GPU.1LKUSF-1GPAX5-C20GTCF12
GEBF
GEBF
Enhanced H3K4Me1
Enhanced H3K27Ac50 _
0 _
Promoter H3K4Me3
Mammal Cons
3 _
-0.5 _
Figure 6.10: Contig k29:3267973 (from Figure 6.9) represented in the human hg18 genome,using the LiftOver tool from the UCSC Genome Browser [62]. The tracks represent thecontig coordinate (from the LiftOver), the contig BLAT alignment, known human genemodels, histone modification tracks, and the conservation.
CHAPTER 6. EXPERIMENTAL RESULTS 61
Coding vs. non-coding Multiclass (Prot + RNA) Multiclass (RNA)Rank All features de novo All features de novo All features de novo1 ORF pro-
portionORF pro-portion
ORF pro-portion
ORF pro-portion
conserv-Num-bases-cov
length-(bp)
2 ORF-size ORF-size length-(bp) length-(bp) histones-Num-bases-cov
Total-energy
3 TAG TAG Conservedareas withcoverage
TAG length-(bp) length
4 ORF score ORF score TAG ORF-size conserv-Num-bases
ORF pro-portion
5 Number ofexons (h)
CG Histoneswith cover-age
TA Total-energy
TG
6 Number ofexons (c)
TA ORF-size ORF score length GA
7 CG CGA Bases withconserva-tion
T ORF pro-portion
GT
8 Conservedexons
TTA TA TT TG GC-content
9 TA TAA ORF score CG GA G10 Conservation
scoreaaD T Total-
energyGT A
11 CGA TTT TT GC-content GC-content T12 TTA TT CG GA G AT13 TAA CCG Total-
energyTAA A AG
14 aaD T GC-content TTA T TGA15 TTT CGG GA TTT AT TC16 TT GTA TAA GTT AG C17 CCG GGA TTA GGA TGA ORF end18 T GTT TTT GC TC AC19 CGG GAC GTT G C CT20 GTA TCG GGA GTA ORF end CA
Table 6.13: The top twenty ranked features based on classification effectiveness from theEnsembl and fRNAdb datasets. The first pair of columns lists the most effective featuresfrom binary class experiements, coding versus non-coding. The second pair of columns liststhe features for the multiclass considering RNA types and proteins. The last pair of columnsis from the multiclass using only RNA types. Both the complete feature set and the de novofeature sets are considered in each of the three experiment types.
CHAPTER 6. EXPERIMENTAL RESULTS 62
binary classification experiment between Ensembl protein coding with the fRNAdb non-
coding RNAs. Starting with the top ranked feature, ‘ORF proportion’, we run the classifier,
then increment the number of features in order of their rank and classify at each step. We
can see the steady rise in performance as the available features are added. The accuracy
rises to 94.8% by the time the top 20 features are used. The complete feature set achieved
an accuracy of 96.3%.
Feats Features added Accuracy Precision Recall
1 ORF proportion 74.8 0.649 0.6752 ORF-size 91.5 0.936 0.8233 TAG 92.3 0.912 0.8724 ORF prediction score 92.8 0.925 0.8725 Number of exons (h) 94.0 0.944 0.8896 Number of exons (c) 94.0 0.944 0.8877 CG 94.6 0.949 0.9008 Conserved exons 94.5 0.949 0.8969 TA 94.5 0.948 0.90010 Conservation score 94.8 0.950 0.90611 CGA 95.0 0.948 0.91412 TTA 95.1 0.948 0.91613 TAA 95.2 0.948 0.91914 aaD 95.0 0.942 0.91915 TTT 94.9 0.941 0.91816 TT 94.9 0.941 0.91817 CCG 95.0 0.941 0.92018 T 94.8 0.939 0.91619 CGG 94.8 0.939 0.91620 GTA 94.8 0.938 0.918
Table 6.14: Classification performance using incrementally, the top twenty ranked featuresfrom the Ensembl and fRNAdb datasets, for the binary classifier. As more features areadded, there is a steady rise in the accuracy, precision and recall. The full model containingall features has an accuracy of 96.3%, precision of 0.966, and recall of 0.976 as shown inTable 6.3.
Chapter 7
Conclusion and future work
Over a short period, our understanding of non-coding RNA has increased dramatically.
No longer just an intermediate for protein synthesis, non-coding RNAs have shown to be
involved in numerous roles in cell biology. At the same time, advancements in transcriptome
studies using RNA-Seq has continued to provide a research platform for new research. Our
work explored the ability of non-coding RNA prediction using an RNA-Seq approach.
7.1 Summary
In this thesis, we present a method and software for classifying transcript sequences as
protein coding vs non-coding, and extend this to distinguish different non-coding RNA
families, which has not been reported in the literature. We also propose a method for
classifying de novo transcriptome contigs from short read RNA-Seq data.
Our results show that the performance of our classifier is comparable to, or in most cases
surpasses, what is reported in the current literature, and suggest that machine learning
based methods can be used to discriminate between different families of non-coding RNA.
The software tools generated in this work are designed to be modular and to be modified
to suit particular needs.
As the number of transcriptome studies continues to increase, especially de novo non-
reference based studies, we expect to see more methods emerge to handle the outputs of
these sometime noisy output sequences. Our investigation into assembled contigs indicate
that classifiers can be expected to contribute in such studies. With improvements in our
63
CHAPTER 7. CONCLUSION AND FUTURE WORK 64
understanding of non-coding RNAs, the quality of non-coding databases, quality of tran-
scriptome experiments and of different assembly algorithms, we expect machine learning
approaches to such problems will continue to improve.
7.2 Future work
Here, we outline a number of areas for improving the calculations described, and directions
that we have yet to explore.
• In our investigation on the full contig set, we found many elements that seem to
be neither functional protein coding nor non-coding, e.g. fragmented contigs and
transcript runoffs in intronic and UTR regions of genes. Depending on the assembly
used, we have seen many fragmented contigs that cannot be merged. It is possible
that these fragmented contigs can have potential features that can be used to classify
into an alternative class of non-functional non-coding RNAs.
• In a true de novo setting in which classification would be applied to a species that
does not have a well-annotated genome sequence, we cannot expect to have database
annotated coding and non-coding sequences for all species. To assess a strictly de novo
classifier we must also explore the ability of building models in one training species
and testing on another.
• Using relative RNA-Seq read coverage as a classifier feature has been shown to be
effective [30, 59, 77]. While this could be done for transcripts and de novo contigs,
our initial focus was on de novo methodologies, and we did not assess this. A quick
follow up could add the RNA-Seq read coverage for each transcript or contig.
• In our collaboration with the Trans-ABySS group we also assessed detecting polyadeny-
lation sites both within transcripts and contig sequences [110]. There is a possibility to
consider this as a source of information when inferring the direction of the transcript
as well as searching for certain polyadenylation signals found in certain 3′ UTRs. Cur-
rently, certain features are not optimised for reverse complement inputs in the feature
extraction and is a topic of further study.
• We assessed only one contig assembly program: ABySS [120], to be used in the de novo
setting. De novo assembly requires higher coverage than reference based methods for
CHAPTER 7. CONCLUSION AND FUTURE WORK 65
reconstructing the transcriptome. It is possible that reference based methods [40, 126]
can increase the sensitivity of transcript detection, though at the same time are also
known to increase false positive results. Evaluating the performance of our classifier
with reference based assembly may also be of interest.
• Our study, along with many others that utilise RNA-Seq, use protocols that are
designed more specifically for protein coding transcript sequencing. Alternative se-
quencing protocols are available that allow the detection of many small non-coding
sequences such as miRNAs. As many non-coding RNAs are small, investigation into
these protocols may provide a more informative framework to test our classifier.
• This thesis investigated different non-coding RNA types and families, and for that task
we focussed mainly on the types found in fRNAdb. Rfam is also one such database
annotated using RNA families. However, our experience has shown it to be difficult
to work with as there were many families with very few entries, as well as entries that
belonged to many families. Due to its strong growth over the years, we do not want to
simply abandon this resource because of these factors, and feel that this should again
be investigated.
Bibliography
[1] Bruce Alberts, Alexander Johnson, Lewis, Julian, Martin Raff, Keith Roberts, andPeter Walter. Molecular Biology of the Cell. Garland Science, 270 Madison Avenue,New York, New York, 5th edition, 2008.
[2] Paulo P. Amaral, Michael B. Clark, Dennis K. Gascoigne, Marcel E. Dinger, andJohn S. Mattick. lncrnadb: a reference database for long noncoding rnas. NucleicAcids Research, 39(suppl 1):D146–D151, 01 2011.
[3] Roberto Arrial, Roberto Togawa, and Marcelo Brigido. Screening non-coding RNAs intranscriptomes from neglected species using PORTRAIT: case study of the pathogenicfungus Paracoccidioides brasiliensis. BMC Bioinformatics, 10(1):239, 2009.
[4] Yan W. Asmann, Michael B. Wallace, and E. Aubrey Thompson. Transcriptomeprofiling using next-generation sequencing. Gastroenterology, 135(5):1466–1468, 112008.
[5] Courtney C. Babbitt, Olivier Fedrigo, Adam D. Pfefferle, Alan P. Boyle, Julie E.Horvath, Terrence S. Furey, and Gregory A. Wray. Both noncoding and protein-coding rnas contribute to gene expression evolution in the primate brain. GenomeBiology and Evolution, 2010(0):67–79, 2010.
[6] JH Badger and GJ Olsen. CRITICA: coding region identification tool invoking com-parative analysis. Mol Biol Evol, 16(4):512–524, 1999.
[7] Asa Ben-Hur, Cheng Soon Ong, Soren Sonnenburg, Bernhard Scholkopf, and GunnarRatsch. Support vector machines and kernels for computational biology. PLoS ComputBiol, 4(10):e1000173–, 10 2008.
[8] E. Birney, J.A. Stamatoyannopoulos, A. Dutta, R. Guig, T.R. Gingeras, E.H. Mar-gulies, Z. Weng, M. Snyder, and E.T. Dermitzakis. Identification and analysis offunctional elements in 1% of the human genome by the encode pilot project. Nature,447(7146):799–816, 06 2007.
[9] Inanc Birol, Shaun D. Jackman, Cydney B. Nielsen, Jenny Q. Qian, Richard Varhol,Greg Stazyk, Ryan D. Morin, Yongjun Zhao, Martin Hirst, Jacqueline E. Schein,Doug E. Horsman, Joseph M. Connors, Randy D. Gascoyne, Marco A. Marra, and
66
BIBLIOGRAPHY 67
Steven J. M. Jones. De novo transcriptome assembly with ABySS. Bioinformatics,25(21):2872–2877, 11 2009.
[10] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, Cam-bridge CB3 0FB, U.K., 2006.
[11] Brigitte Boeckmann, Amos Bairoch, Rolf Apweiler, Marie-Claude Blatter, Anne Es-treicher, Elisabeth Gasteiger, Maria J. Martin, Karine Michoud, Claire O’Donovan,Isabelle Phan, Sandrine Pilbout, and Michel Schneider. The SWISS-PROT pro-tein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research,31(1):365–370, 1 2003.
[12] Dario Boffelli, Jon McAuliffe, Dmitriy Ovcharenko, Keith D. Lewis, Ivan Ovcharenko,Lior Pachter, and Edward M. Rubin. Phylogenetic shadowing of primate sequences tofind functional regions of the human genome. Science, 299(5611):1391–1394, 02 2003.
[13] George A. Calin, Chang-gong Liu, Manuela Ferracin, Terry Hyslop, Riccardo Spizzo,Cinzia Sevignani, Muller Fabbri, Amelia Cimmino, Eun Joo Lee, Sylwia E. Wojcik,Masayoshi Shimizu, Esmerina Tili, Simona Rossi, Cristian Taccioli, Flavia Pichiorri,Xiuping Liu, Simona Zupo, Vlad Herlea, Laura Gramantieri, Giovanni Lanza, Han-sjuerg Alder, Laura Rassenti, Stefano Volinia, Thomas D. Schmittgen, Thomas J.Kipps, Massimo Negrini, and Carlo M. Croce. Ultraconserved regions encoding ncR-NAs are altered in human leukemias and carcinomas. Cancer Cell, 12(3):215 – 229,Sep 2007.
[14] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a Library for Support Vector Ma-chines. National Taiwan University, 2001.
[15] F. Chiaromonte, R. J. Weber, K. M. Roskin, M. Diekhans, W. J. Kent, and D. Haus-sler. The share of human genomic dna under selection estimated from human–mousegenomic alignments. Cold Spring Harbor Symposia on Quantitative Biology, 68:245–254, 01 2003.
[16] Liam Childs, Zoran Nikoloski, Patrick May, and Dirk Walther. Identification andclassification of ncRNA molecules using graph properties. Nucleic Acids Research,37(9):e66–e66, 05 2009.
[17] Rebecca Chodroff, Leo Goodstadt, Tamara Sirey, Peter Oliver, Kay Davies, EricGreen, Zoltan Molnar, and Chris Ponting. Long noncoding RNA genes: conserva-tion of sequence and brain expression among diverse amniotes. Genome Biology,11(7):R72, 2010.
[18] Michele Clamp, Ben Fry, Mike Kamal, Xiaohui Xie, James Cuff, Michael F. Lin, Mano-lis Kellis, Kerstin Lindblad-Toh, and Eric S. Lander. Distinguishing protein-codingand noncoding genes in the human genome. Proceedings of the National Academy ofSciences, 104(49):19428–19433, 12 2007.
BIBLIOGRAPHY 68
[19] Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysisof the mouse genome. Nature, 420(6915):520–562, 12 2002.
[20] Rat Genome Sequencing Project Consortium. Genome sequence of the Brown Norwayrat yields insights into mammalian evolution. Nature, 428(6982):493–521, 04 2004.
[21] Gregory M. Cooper, Michael Brudno, Eric A. Stone, Inna Dubchak, Serafim Bat-zoglou, and Arend Sidow. Characterization of evolutionary rates and constraints inthree mammalian genomes. Genome Research, 14(4):539–548, 04 2004.
[22] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning,20(3):273–297, 1995-09-01.
[23] Jennifer Couzin. Breakthrough of the year: Small RNAs Make Big Splash. Science,298(5602):2296–2297, 2002.
[24] Teresa Creanza, David Horner, Annarita D’Addabbo, Rosalia Maglietta, FlavioMignone, Nicola Ancona, and Graziano Pesole. Statistical assessment of discrimi-native features for protein-coding and non coding cross-species conserved sequenceelements. BMC Bioinformatics, 10(Suppl 6):S2, 2009.
[25] Marcel E. Dinger, Ken C. Pang, Tim R. Mercer, and John S. Mattick. Differentiatingprotein-coding and noncoding RNA: Challenges and ambiguities. PLoS Comput Biol,4(11):e1000176–, 11 2008.
[26] I. Dondoshansky. Blastclust (NCBI software development toolkit), 6.1 edition, 2002.
[27] Sean R. Eddy. Non-coding RNA genes and the modern RNA world. Nat Rev Genet,2(12):919–929, 12 2001.
[28] Sean R. Eddy and Richard Durbin. RNA sequence analysis using covariance models.Nucleic Acids Research, 22(11):2079–2088, 06 1994.
[29] Yasser EL-Manzalawy and Vasant Honavar. WLSVM: Integrating LibSVM into WekaEnvironment, 2005.
[30] Florian Erhard and Ralf Zimmer. Classification of ncrnas using position and sizeinformation in deep sequencing data. Bioinformatics, 26(18):i426–i432, 09 2010.
[31] N. Erho and K. Wiese. An exploration of individual RNA structural elements inRNA gene finding. Computational Intelligence in Bioinformatics and ComputationalBiology (CIBCB), 2010 IEEE Symposium on, pages 1–9, 2-5 May 2010.
[32] Noah Fahlgren, Miya D. Howell, Kristin D. Kasschau, Elisabeth J. Chapman, Christo-pher M. Sullivan, Jason S. Cumbie, Scott A. Givan, Theresa F. Law, Sarah R. Grant,Jeffery L. Dangl, and James C. Carrington. High-throughput sequencing of Arabidop-sis microRNAs: Evidence for frequent birth and death of MIRNA genes. PLoS ONE,2(2):e219, 2007.
BIBLIOGRAPHY 69
[33] Alistair R. R. Forrest, Rehab F. Abdelhamid, and Piero Carninci. Annotating non-coding transcription using functional genomics strategies. Briefings in FunctionalGenomics & Proteomics, 8(6):437–443, 11 2009.
[34] Kelly A. Frazer, Lior Pachter, Alexander Poliakov, Edward M. Rubin, and InnaDubchak. Vista: computational tools for comparative genomics. Nucleic Acids Re-search, 32(suppl 2):W273–W279, 07 2004.
[35] Masaaki Furuno, Ken C Pang, Noriko Ninomiya, Shiro Fukuda, Martin C Frith, CarolBult, Chikatoshi Kai, Jun Kawai, Piero Carninci, Yoshihide Hayashizaki, John SMattick, and Harukazu Suzuki. Clusters of Internally Primed Transcripts RevealNovel Long Noncoding. PLoS Genet, 2(4):e37, 04 2006.
[36] Paul P. Gardner, Jennifer Daub, John G. Tate, Eric P. Nawrocki, Diana L. Kolbe,Stinus Lindgreen, Adam C. Wilkinson, Robert D. Finn, Sam Griffiths-Jones, Sean R.Eddy, and Alex Bateman. Rfam: updates to the RNA families database. NucleicAcids Research, pages gkn766–, 10 2008.
[37] G.B. Golding. Simple sequence is abundant in eukaryotic proteins. PRS, 8(06):1358–1361, 1999.
[38] Sam Griffiths-Jones, Russell J. Grocock, Stijn van Dongen, Alex Bateman, and An-ton J. Enright. miRBase: microRNA sequences, targets and gene nomenclature. Nu-cleic Acids Research, 34(suppl 1):D140–144, 1 2006.
[39] Mitchell Guttman, Ido Amit, Manuel Garber, Courtney French, Michael F. Lin, DavidFeldser, Maite Huarte, Or Zuk, Bryce W. Carey, John P. Cassady, Moran N. Cabili,Rudolf Jaenisch, Tarjei S. Mikkelsen, Tyler Jacks, Nir Hacohen, Bradley E. Bernstein,Manolis Kellis, Aviv Regev, John L. Rinn, and Eric S. Lander. Chromatin signaturereveals over a thousand highly conserved large non-coding RNAs in mammals. Nature,458(7235):223–227, 03 2009.
[40] Mitchell Guttman, Manuel Garber, Joshua Z Levin, Julie Donaghey, James Robinson,Xian Adiconis, Lin Fan, Magdalena J Koziol, Andreas Gnirke, Chad Nusbaum, John LRinn, Eric S Lander, and Aviv Regev. Ab initio reconstruction of cell type-specifictranscriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs.Nat Biotech, 28(5):503–510, 05 2010.
[41] Brian J Haas and Michael C Zody. Advancing RNA-Seq analysis. Nat Biotech,28(5):421–423, 05 2010.
[42] Michael Hackenberg, Martin Sturm, David Langenberger, Juan Manuel Falcon-Perez,and Ana M. Aransay. miRanalyzer: a microRNA detection and analysis tool for next-generation sequencing experiments. Nucleic Acids Research, 37(suppl 2):W68–W76,07 2009.
BIBLIOGRAPHY 70
[43] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann,and Ian H. Witten. The WEKA Data Mining Software: An Update; SIGKDD Explo-rations. SIGKDD Explorations Newsletter, 11(1), June 2009.
[44] Ross C. Hardison, John Oeltjen, and Webb Miller. Long human–mouse sequencealignments reveal novel regulatory elements: A reason to sequence the mouse genome.Genome Research, 7(10):959–966, 10 1997.
[45] Artemis G. Hatzigeorgiou, Petko Fiziev, and Martin Reczko. DIANA-EST: a statis-tical analysis. Bioinformatics, 17(10):913–919, 10 2001.
[46] Shunmin He, Changning Liu, Geir Skogerbo, Haitao Zhao, Jie Wang, Tao Liu, BaoyanBai, Yi Zhao, and Runsheng Chen. NONCODE v2.0: decoding the non-coding. Nucl.Acids Res., page gkm1011, 2007.
[47] David Hendrix, Michael Levine, and Weiyang Shi. miRTRAP, a computational methodfor the systematic identification of miRNAs from high throughput sequencing data.Genome Biology, 11(4):R39, 2010.
[48] Michael Hiller, Sven Findeiß, Sandro Lein, Manja Marz, Claudia Nickel, DominicRose, Christine Schulz, Rolf Backofen, Sonja J. Prohaska, Gunter Reuter, and Pe-ter F. Stadler. Conserved introns reveal novel transcripts in Drosophila melanogaster.Genome Research, 19(7):1289–1300, 07 2009.
[49] I. L. Hofacker, W. Fontana, P. F. Stadler, L. S. Bonhoeffer, M. Tacker, and P. Schuster.Fast folding and comparison of RNA secondary structures. Monatshefte fur Chemie/ Chemical Monthly, 125(2):167–188, 02 1994.
[50] I. L. Hofacker, B. Priwitzer, and P. F. Stadler. Prediction of locally stable RNAsecondary structures for genome-wide surveys. Bioinformatics, 20(2):186–190, 1 2004.
[51] Ivo L. Hofacker. Vienna RNA secondary structure server. Nucleic Acids Research,31(13):3429–3431, 7 2003.
[52] Yair Horesh, Ydo Wexler, Ilana Lebenthal, Michal Ziv-Ukelson, and Ron Unger.RNAslider: a faster engine for consecutive windows folding and its application tothe analysis of genomic folding asymmetry. BMC Bioinformatics, 10(1):76, 2009.
[53] Fan Hsu, W. James Kent, Hiram Clawson, Robert M. Kuhn, Mark Diekhans, andDavid Haussler. The UCSC Known Genes. Bioinformatics, 22(9):1036–1046, 05 2006.
[54] Tzu-Kuo Huang, Ruby C. Weng, and Chih-Jen Lin. Generalized bradley-terry modelsand multi-class probability estimates. J. Mach. Learn. Res., 7:85–115, December 2006.
BIBLIOGRAPHY 71
[55] T. J. P. Hubbard, B. L. Aken, S. Ayling, B. Ballester, K. Beal, E. Bragin, S. Brent,Y. Chen, P. Clapham, L. Clarke, G. Coates, S. Fairley, S. Fitzgerald, J. Fernandez-Banet, L. Gordon, S. Graf, S. Haider, M. Hammond, R. Holland, K. Howe, A. Jenkin-son, N. Johnson, A. Kahari, D. Keefe, S. Keenan, R. Kinsella, F. Kokocinski, E. Kule-sha, D. Lawson, I. Longden, K. Megy, P. Meidl, B. Overduin, A. Parker, B. Pritchard,D. Rios, M. Schuster, G. Slater, D. Smedley, W. Spooner, G. Spudich, S. Trevan-ion, A. Vilella, J. Vogel, S. White, S. Wilder, A. Zadissa, E. Birney, F. Cunningham,V. Curwen, R. Durbin, X. M. Fernandez-Suarez, J. Herrero, A. Kasprzyk, G. Proctor,J. Smith, S. Searle, and P. Flicek. Ensembl 2009. Nucleic Acids Research, 37(suppl1):D690–D697, 01 2009.
[56] A. M. Hughes. Oxford English Dictionary. Isis, 99(3):586, Sep 2008.
[57] D. E. Janes, C. Chapus, Y. Gondo, D. F. Clayton, S. Sinha, C. A. Blatti, C. L. Organ,M. K. Fujita, C. N. Balakrishnan, and S. V. Edwards. Reptiles and mammals havedifferentially retained long conserved noncoding sequences from the amniote ancestor.Genome Biology and Evolution, 3:102–113, 01 2011.
[58] Hui Jia, Maureen Osak, Gireesh K. Bogu, Lawrence W. Stanton, Rory Johnson, andLeonard Lipovich. Genome-wide computational identification and manual annotationof human long noncoding RNA genes. RNA, 16(8):1478–1487, 08 2010.
[59] Chol-Hee Jung, Martin Hansen, Igor Makunin, Darren Korbie, and John Mattick.Identification of novel non-coding RNAs using profiles of short sequence reads fromnext generation sequencing data. BMC Genomics, 11(1):77, 2010.
[60] Manolis Kellis, Nick Patterson, Matthew Endrizzi, Bruce Birren, and Eric S. Lander.Sequencing and comparison of yeast species to identify genes and regulatory elements.Nature, 423(6937):241–254, 05 2003.
[61] W. James Kent. BLAT—the BLAST-like alignment tool. Genome Research,12(4):656–664, 04 2002.
[62] W. James Kent, Charles W. Sugnet, Terrence S. Furey, Krishna M. Roskin, Tom H.Pringle, Alan M. Zahler, and David Haussler. The human genome browser at UCSC.Genome Research, 12(6):996–1006, 06 2002.
[63] Taishin Kin, Kouichirou Yamada, Goro Terai, Hiroaki Okida, Yasuhiko Yoshinari,Yukiteru Ono, Aya Kojima, Yuki Kimura, Takashi Komori, and Kiyoshi Asai.fRNAdb: a platform for mining/annotating functional RNA candidates from non-coding RNA sequences. Nucleic Acids Research, 35(suppl 1):D145–148, 1 2007.
[64] Lei Kong, Yong Zhang, Zhi-Qiang Ye, Xiao-Qiao Liu, Shu-Qi Zhao, Liping Wei, andGe Gao. Cpc: assess the protein-coding potential of transcripts using sequence featuresand support vector machine. Nucleic Acids Research, 35(suppl 2):W345–349, 7 2007.
BIBLIOGRAPHY 72
[65] Jack Kyte and Russell F. Doolittle. A simple method for displaying the hydropathiccharacter of a protein. Journal of Molecular Biology, 157(1):105 – 132, 1982.
[66] S. Sai Lakshmi and Shipra Agrawal. piRNABank: a web resource on classified andclustered Piwi-interacting RNAs. Nucleic Acids Research, 36(suppl 1):D173–D177, 012008.
[67] David Langenberger, Clara Bermudez-Santana, Jana Hertel, Steve Hoffmann, PhilippKhaitovich, and Peter F. Stadler. Evidence for human microRNA-offset RNAs insmall RNA sequencing data. Bioinformatics, 25(18):2298–2301, 2009.
[68] M. A. Larkin, G. Blackshields, N. P. Brown, R. Chenna, P. A. McGettigan,H. McWilliam, F. Valentin, I. M. Wallace, A. Wilm, R. Lopez, J. D. Thompson,T. J. Gibson, and D. G. Higgins. Clustal W and clustal X version 2.0. Bioinformatics,23(21):2947–2948, 11 2007.
[69] Rasko Leinonen, Ruth Akhtar, Ewan Birney, James Bonfield, Lawrence Bower, MattCorbett, Ying Cheng, Fehmi Demiralp, Nadeem Faruque, Neil Goodgame, RichardGibson, Gemma Hoad, Christopher Hunter, Mikyung Jang, Steven Leonard, QuanLin, Rodrigo Lopez, Michael Maguire, Hamish McWilliam, Sheila Plaister, RajeshRadhakrishnan, Siamak Sobhany, Guy Slater, Petra Ten Hoopen, Franck Valentin,Robert Vaughan, Vadim Zalunin, Daniel Zerbino, and Guy Cochrane. Improvementsto services at the European Nucleotide Archive. Nucleic Acids Research, 38(suppl1):D39–D45, 01 2010.
[70] Heng Li and Richard Durbin. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25(14):1754–1760, 07 2009.
[71] Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, GaborMarth, Goncalo Abecasis, Richard Durbin, and 1000 Genome Project Data Process-ing Subgroup. The sequence alignment/map format and SAMtools. Bioinformatics,25(16):2078–2079, 08 2009.
[72] Jiong-Tang Li, Yong Zhang, Lei Kong, Qing-Rong Liu, and Liping Wei. Trans-naturalantisense transcripts including noncoding rnas in 10 species: implications for expres-sion regulation. Nucleic Acids Research, 36(15):4833–4844, 09 2008.
[73] Weizhong Li and Adam Godzik. Cd-hit: a fast program for clustering and comparinglarge sets of protein or nucleotide sequences. Bioinformatics, 22(13):1658–1659, 72006.
[74] Jinfeng Liu, Julian Gough, and Burkhard Rost. Distinguishing protein-coding fromnon-coding rnas through support vector machines. PLoS Genet, 2(4):e29, 04 2006.
[75] G. G. Loots, R. M. Locksley, C. M. Blankespoor, Z. E. Wang, W. Miller, E. M. Rubin,and K. A. Frazer. Identification of a coordinate regulator of interleukins 4, 13, and 5by cross-species sequence comparisons. Science, 288(5463):136–140, 04 2000.
BIBLIOGRAPHY 73
[76] C. Lottaz, C. Iseli, C. V. Jongeneel, and P. Bucher. Modeling sequencing errors bycombining Hidden Markov models. Bioinformatics, 19(suppl 2):ii103–112, 9 2003.
[77] Zhi John Lu, Kevin Y. Yip, Guilin Wang, Chong Shou, LaDeana W. Hillier, Ekta Khu-rana, Ashish Agarwal, Raymond Auerbach, Joel Rozowsky, Chao Cheng, MasaomiKato, David M. Miller, Frank Slack, Michael Snyder, Robert H. Waterson, ValerieReinke, and Mark Gerstein. Prediction and characterization of non-coding RNAs inC. elegans by integrating conservation, secondary structure and high throughput se-quencing and array data. Genome Research, 10.1101/gr.110189.110, December 2010.
[78] R. B. Lyngsø and C. N. Pedersen. RNA pseudoknot prediction in energy-based models.J Comput Biol, 7(3-4):409–427, 2000.
[79] Ariane Machado-Lima, Hernando del Portillo, and Alan Durham. Computationalmethods in noncoding RNA research. Journal of Mathematical Biology, 56(1):15–49,01 2008.
[80] J.R. Manak, S. Dike, V. Sementchenko, P. Kapranov, F. Biemar, J. Long, J. Cheng,I. Bell, S. Ghosh, A. Piccolboni, and T.R. Gingeras. Identification and analysis offunctional elements in 1% of the human genome by the ENCODE pilot project. Nature,447(7146):799–816, 06 2007.
[81] Samuel Marguerat, Brian T. Wilhelm, and Jurg Bahler. Next-generation sequencing:applications beyond genomes. Biochemical Society transactions, 36(Pt 5):1091–1096,October 2008.
[82] Elliott H. Margulies, Mathieu Blanchette, NISC Comparative Sequencing Program,David Haussler, and Eric D. Green. Identification and characterization of multi-speciesconserved sequences. Genome Research, 13(12):2507–2518, 12 2003.
[83] Anthony Mathelier and Alessandra Carbone. MIReNA: finding microRNAs with highaccuracy and no learning at genome scale and from deep sequencing data. Bioinfor-matics, 26(18):2226–2234, 09 2010.
[84] Pedro P. Medina, Mona Nolde, and Frank J. Slack. OncomiR addiction in an in vivomodel of microRNA-21-induced pre-B-cell lymphoma. Nature, 467(7311):86–90, 092010.
[85] Tim R. Mercer, Marcel E. Dinger, and John S. Mattick. Long non-coding RNAs:insights into functions. Nat Rev Genet, 10(3):155–159, 03 2009.
[86] Michael L. Metzker. Sequencing technologies – the next generation. Nat Rev Genet,11(1):31–46, 01 2010.
[87] Flavio Mignone, Anna Anselmo, Giacinto Donvito, Giorgio Maggi, Giorgio Grillo,and Graziano Pesole. Genome-wide identification of coding and non-coding conservedsequence tags in human and mouse genomes. BMC Genomics, 9(1):277, 2008.
BIBLIOGRAPHY 74
[88] Tarjei S. Mikkelsen, Manching Ku, David B. Jaffe, Biju Issac, Erez Lieberman, GeorgiaGiannoukos, Pablo Alvarez, William Brockman, Tae-Kyung Kim, Richard P. Koche,William Lee, Eric Mendenhall, Aisling O/’Donovan, Aviva Presser, Carsten Russ,Xiaohui Xie, Alexander Meissner, Marius Wernig, Rudolf Jaenisch, Chad Nusbaum,Eric S. Lander, and Bradley E. Bernstein. Genome-wide maps of chromatin state inpluripotent and lineage-committed cells. Nature, 448(7153):553–560, 08 2007.
[89] The modENCODE Consortium, Sushmita Roy, Jason Ernst, Peter V. Kharchenko,Pouya Kheradpour, Nicolas Negre, Matthew L. Eaton, Jane M. Landolin, Christo-pher A. Bristow, Lijia Ma, Michael F. Lin, Stefan Washietl, Bradley I. Arshinoff,Ferhat Ay, Patrick E. Meyer, Nicolas Robine, Nicole L. Washington, Luisa Di Ste-fano, Eugene Berezikov, Christopher D. Brown, Rogerio Candeias, Joseph W. Carlson,Adrian Carr, Irwin Jungreis, Daniel Marbach, Rachel Sealfon, Michael Y. Tolstorukov,Sebastian Will, Artyom A. Alekseyenko, Carlo Artieri, Benjamin W. Booth, Angela N.Brooks, Qi Dai, Carrie A. Davis, Michael O. Duff, Xin Feng, Andrey A. Gorchakov,Tingting Gu, Jorja G. Henikoff, Philipp Kapranov, Renhua Li, Heather K. MacAlpine,John Malone, Aki Minoda, Jared Nordman, Katsutomo Okamura, Marc Perry, Sara K.Powell, Nicole C. Riddle, Akiko Sakai, Anastasia Samsonova, Jeremy E. Sandler,Yuri B. Schwartz, Noa Sher, Rebecca Spokony, David Sturgill, Marijke van Baren,Kenneth H. Wan, Li Yang, Charles Yu, Elise Feingold, Peter Good, Mark Guyer,Rebecca Lowdon, Kami Ahmad, Justen Andrews, Bonnie Berger, Steven E. Brenner,Michael R. Brent, Lucy Cherbas, Sarah C. R. Elgin, Thomas R. Gingeras, RobertGrossman, Roger A. Hoskins, Thomas C. Kaufman, William Kent, Mitzi I. Kuroda,Terry Orr-Weaver, Norbert Perrimon, Vincenzo Pirrotta, James W. Posakony, BingRen, Steven Russell, Peter Cherbas, Brenton R. Graveley, Suzanna Lewis, Gos Mick-lem, Brian Oliver, Peter J. Park, Susan E. Celniker, Steven Henikoff, Gary H. Karpen,Eric C. Lai, David M. MacAlpine, Lincoln D. Stein, Kevin P. White, and Mano-lis Kellis. Identification of functional elements and regulatory circuits by drosophilamodencode. Science, 330(6012):1787–1797, 12 2010.
[90] Ryan D. Morin, Matthew Bainbridge, Anthony Fejes, Martin Hirst, Martin Krzy-winski, Trevor J. Pugh, Helen McDonald, Richard Varhol, Steven J.M. Jones, andMarco A. Marra. Profiling the HeLa S3 transcriptome using randomly primed cDNAand massively parallel short-read sequencing. Biotechniques, 45(1):81–94, July 2008.
[91] Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaeffer, and BarbaraWold. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Meth,5(7):621–628, 07 2008.
[92] Ugrappa Nagalakshmi, Zhong Wang, Karl Waern, Chong Shou, Debasish Raha, MarkGerstein, and Michael Snyder. The transcriptional landscape of the Yeast genomedefined by RNA sequencing. Science, 320(5881):1344–1349, 06 2008.
BIBLIOGRAPHY 75
[93] Marcelo A. Nobrega, Yiwen Zhu, Ingrid Plajzer-Frick, Veena Afzal, and Ed-ward M. Rubin. Megabase deletions of gene deserts result in viable mice. Nature,431(7011):988–993, 10 2004.
[94] Kirt Noel. Examining stem-loops as a sequence signal for identifying structural RNAgenes. Master’s thesis, Simon Fraser University, April 2005.
[95] Karl J. V. Nordstrom, Majd A. I. Mirza, Markus Sallman Almen, David E. Gloriam,Robert Fredriksson, and Helgi B. Schloth. Critical evaluation of the FANTOM3 non-coding RNA transcripts. Genomics, 94(3):169–176, 9 2009.
[96] David L. Olson and Dursun Delen. Advanced Data Mining Techniques. SpringerPublishing Company, Incorporated, 1st edition, 2008.
[97] Ulf Andersson Ørom, Thomas Derrien, Malte Beringer, Kiranmai Gumireddy,Alessandro Gardini, Giovanni Bussotti, Fan Lai, Matthias Zytnicki, CedricNotredame, Qihong Huang, Roderic Guigo, and Ramin Shiekhattar. Long noncodingRNAs with enhancer-like function in human cells. Cell, 143(1):46–58, 10 2010.
[98] Ken C. Pang, Martin C. Frith, and John S. Mattick. Rapid evolution of noncodingRNAs: lack of conservation does not mean lack of function. Trends in Genetics,22(1):1–5, 1 2006.
[99] Ken C. Pang, Stuart Stephen, Marcel E. Dinger, Par G. Engstrom, Boris Lenhard,and John S. Mattick. RNAdb 2.0–an expanded database of mammalian non-codingRNAs. Nucleic Acids Research, 35(suppl 1):D178–182, 1 2007.
[100] Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An Eulerian path approachto DNA fragment assembly. Proceedings of the National Academy of Sciences of theUnited States of America, 98(17):9748–9753, 08 2001.
[101] Elisabetta Pizzi and Clara Frontali. Low-complexity regions in Plasmodium falci-parum proteins. Genome Research, 11:218–229, 2001.
[102] Vasilis J. Promponas, Anton J. Enright, Sophia Tsoka, David P. Kreil, ChristopheLeroy, Stavros Hamodrakas, Chris Sander, and Christos A. Ouzounis. CAST: aniterative algorithm for the complexity analysis of sequence tracts. Bioinformatics,16(10):915–922, 10 2000.
[103] Kim D. Pruitt, Tatiana Tatusova, and Donna R. Maglott. NCBI reference sequences(RefSeq): a curated non-redundant sequence database of genomes, transcripts andproteins. Nucleic Acids Research, pages gkl842–, 11 2006.
[104] Matteo Re, Graziano Pesole, and David Horner. Accurate discrimination of conservedcoding and non-coding regions through multiple indicators of evolutionary dynamics.BMC Bioinformatics, 10(1):282, 2009.
BIBLIOGRAPHY 76
[105] Brooke Rhead, Donna Karolchik, Robert M. Kuhn, Angie S. Hinrichs, Ann S. Zweig,Pauline A. Fujita, Mark Diekhans, Kayla E. Smith, Kate R. Rosenbloom, Brian J.Raney, Andy Pohl, Michael Pheasant, Laurence R. Meyer, Katrina Learned, Fan Hsu,Jennifer Hillman-Jackson, Rachel A. Harte, Belinda Giardine, Timothy R. Dreszer,Hiram Clawson, Galt P. Barber, David Haussler, and W. James Kent. The UCSCGenome Browser database: update 2010. Nucleic Acids Research, 38(suppl 1):D613–D619, 01 2010.
[106] Peter Rice, Ian Longden, and Alan Bleasby. EMBOSS: The European MolecularBiology Open Software Suite. Trends in Genetics, 16(6):276 – 277, 2000.
[107] E. Rivas and S. R. Eddy. Noncoding RNA gene detection using comparative sequenceanalysis. BMC bioinformatics, 2(1):8+, 2001.
[108] A. Gordon Robertson, Mikhail Bilenky, Angela Tam, Yongjun Zhao, Thomas Zeng,Nina Thiessen, Timothee Cezard, Anthony P. Fejes, Elizabeth D. Wederell, RebeccaCullum, Ghia Euskirchen, Martin Krzywinski, Inanc Birol, Michael Snyder, Pamela A.Hoodless, Martin Hirst, Marco A. Marra, and Steven J. M. Jones. Genome-widerelationship between histone H3 lysine 4 mono- and tri-methylation and transcriptionfactor binding. Genome Research, 18(12):1906–1917, 12 2008.
[109] Gordon Robertson, Martin Hirst, Matthew Bainbridge, Misha Bilenky, Yongjun Zhao,Thomas Zeng, Ghia Euskirchen, Bridget Bernier, Richard Varhol, Allen Delaney, NinaThiessen, Obi L Griffith, Ann He, Marco Marra, Michael Snyder, and Steven Jones.Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipita-tion and massively parallel sequencing. Nat Meth, 4(8):651–657, 08 2007.
[110] Gordon Robertson, Jacqueline Schein, Readman Chiu, Richard Corbett, MatthewField, Shaun D Jackman, Karen Mungall, Sam Lee, Hisanaga Mark Okada, Jenny QQian, Malachi Griffith, Anthony Raymond, Nina Thiessen, Timothee Cezard, Yaron SButterfield, Richard Newsome, Simon K Chan, Rong She, Richard Varhol, BaljitKamoh, Anna-Liisa Prabhu, Angela Tam, YongJun Zhao, Richard A Moore, MartinHirst, Marco A Marra, Steven J M Jones, Pamela A Hoodless, and Inanc Birol.De novo assembly and analysis of RNA-seq data. Nature Methods, advance onlinepublication, October 2010.
[111] Brid M. Ryan, Ana I. Robles, and Curtis C. Harris. Genetic variation in microRNAnetworks: the implications for cancer research. Nat Rev Cancer, 10(6):389–402, 062010.
[112] R. Salari, C. Aksay, E. Karakoc, P. J. Unrau, I. Hajirasouliha, S. C. Sahinalp, andS. Maas. smyRNA: A Novel Ab Initio ncRNA Gene Finder. PLoS ONE, 4:5433, May2009.
BIBLIOGRAPHY 77
[113] F. Sanger, S. Nicklen, and A. R. Coulson. DNA sequencing with chain-terminatinginhibitors. Proceedings of the National Academy of Sciences, 74(12):5463–5467, 121977.
[114] Kengo Sato, Michiaki Hamada, Kiyoshi Asai, and Toutai Mituyama. CentroidFold: aweb server for RNA secondary structure prediction. Nucleic Acids Research, 37(suppl2):W277–W280, 07 2009.
[115] Bruce A Shapiro, Yaroslava G Yingling, Wojciech Kasprzak, and Eckart Bindewald.Bridging the gap in RNA structure prediction. Current Opinion in Structural Biology,17(2):157 – 165, 2007. Theory and simulation / Macromolecular assemblages.
[116] Kana Shimizu, Jun Adachi, and Yoichi Muraoka. ANGLE: a sequencing errors resis-tant program for predicting protein coding regions in unfinished cDNA. Journal ofBioinformatics Computal Biology, 4(3):649–64, June 2006.
[117] Christian Honer zu Siederdissen and Ivo L. Hofacker. Discriminatory power of RNAfamily models. Bioinformatics, 26(18):i453–i459, 09 2010.
[118] Adam Siepel, Gill Bejerano, Jakob S. Pedersen, Angie S. Hinrichs, Minmei Hou, KateRosenbloom, Hiram Clawson, John Spieth, LaDeana W. Hillier, Stephen Richards,George M. Weinstock, Richard K. Wilson, Richard A. Gibbs, W. James Kent, WebbMiller, and David Haussler. Evolutionarily conserved elements in vertebrate, insect,worm, and yeast genomes. Genome Research, 15(8):1034–1050, 08 2005.
[119] Tulio C. Silva, Pedro A. Berger, Roberto T. Arrial, Roberto C. Togawa, Marcelo M.Brigido, and Maria Emilia M. T. Walter. SOM-PORTRAIT: Identifying Non-codingRNAs Using Self-Organizing Maps, volume 5676/2009 of Lecture Notes in ComputerScience. Springer Berlin / Heidelberg, 2009.
[120] Jared T. Simpson, Kim Wong, Shaun D. Jackman, Jacqueline E. Schein, Steven J.M.Jones, and Inanc Birol. ABySS: A parallel assembler for short read sequence data.Genome Research, 19:1117–1123, February 2009.
[121] G.S.C. Slater. Algorithms for the Analysis of Expressed Sequence Tags. PhD thesis,University of Cambridge, Cambridge, 2000.
[122] Tomasz Smolinski, Mariofanna Milanova, Aboul-Ella Hassanien, Kirt Noel, and KayWiese. Considering Stem-Loops as Sequence Signals for Finding Ribosomal RNAGenes, volume 151, pages 337–357. Springer Berlin / Heidelberg, 2008.
[123] MJ Solomon, PL Larsen, and A Varshavsky. Mapping protein-DNA interactions invivo with formaldehyde: evidence that histone H4 is retained on a highly transcribedgene. Cell, 53(6):937–947, 06 1988.
BIBLIOGRAPHY 78
[124] Jason E. Stajich, David Block, Kris Boulez, Steven E. Brenner, Stephen A. Chervitz,Chris Dagdigian, Georg Fuellen, James G.R. Gilbert, Ian Korf, Hilmar Lapp, HeikkiLehvaslaiho, Chad Matsalla, Chris J. Mungall, Brian I. Osborne, Matthew R. Pocock,Peter Schattner, Martin Senger, Lincoln D. Stein, Elia Stupka, Mark D. Wilkinson,and Ewan Birney. The Bioperl toolkit: Perl modules for the life sciences. GenomeResearch, 12:1611–1618, 2002.
[125] The FANTOM Consortium. The transcriptional landscape of the mammalian genome.Science, 309(5740):1559–1563, 9 2005.
[126] Cole Trapnell, Brian A Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Marijke Jvan Baren, Steven L Salzberg, Barbara J Wold, and Lior Pachter. Transcript assemblyand quantification by RNA-Seq reveals unannotated transcripts and isoform switchingduring cell differentiation. Nat Biotech, 28(5):511–515, 05 2010.
[127] Huei-Hun H. Tseng, Zasha Weinberg, Jeremy Gore, Ronald R. Breaker, and Wal-ter L. Ruzzo. Finding non-coding RNAs through genome-scale clustering. Journal ofbioinformatics and computational biology, 7(2):373–388, April 2009.
[128] Andrew Uzilov, Joshua Keegan, and David Mathews. Detection of non-coding RNAson the basis of predicted secondary structure formation free energy change. BMCBioinformatics, 7(1):173, 2006.
[129] Bjorn Voß. Structural analysis of aligned RNAs. Nucleic Acids Research, 34(19):5471–5481, 2006.
[130] Bjorn Voß, Jens Georg, Verena Schon, Susanne Ude, and Wolfgang Hess. Biocom-putational prediction of non-coding RNAs in model cyanobacteria. BMC Genomics,10(1):123, 2009.
[131] Jiayi Wang, Xiangfan Liu, Huacheng Wu, Peihua Ni, Zhidong Gu, Yongxia Qiao,Ning Chen, Fenyong Sun, and Qishi Fan. CREB up-regulates long non-coding RNA,HULC expression through interaction with microRNA-372 in liver cancer. NucleicAcids Research, 38(16):5366–5383, 09 2010.
[132] Zhong Wang, Mark Gerstein, and Michael Snyder. RNA-Seq: a revolutionary tool fortranscriptomics. Nat Rev Genet, 10(1):57–63, 01 2009.
[133] Stefan Washietl, Ivo L. Hofacker, and Peter F. Stadler. Fast and reliable predictionof noncoding RNAs. Proceedings of the National Academy of Sciences of the UnitedStates of America, 102(7):2454–2459, 2005.
[134] Zasha Weinberg, Jonathan Perreault, Michelle M. Meyer, and Ronald R. Breaker.Exceptional structured noncoding RNAs revealed by bacterial metagenome analysis.Nature, 462(7273):656–659, 12 2009.
BIBLIOGRAPHY 79
[135] Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Toolsand Techniques. Morgan Kaufmann Series in Data Management Systems. MorganKaufmann, second edition, June 2005.
[136] Adam Woolfe, Martin Goodson, Debbie K Goode, Phil Snell, Gayle K McEwen, TanyaVavouri, Sarah F Smith, Phil North, Heather Callaway, Krys Kelly, Klaudia Walter,Irina Abnizova, Walter Gilks, Yvonne J. K Edwards, Julie E Cooke, and Greg Elgar.Highly conserved non-coding sequences are associated with vertebrate development.PLoS Biol, 3(1):e7, 11 2004.
[137] Jing Wu. Testing the coding potential of conserved short genomic sequences. Advancesin Bioinformatics, Article ID 287070, 8 pages, 2010.
[138] Jun Xie, Ming Zhang, Tao Zhou, Xia Hua, LiSha Tang, and Weilin Wu. scaRNAbase:a curated database for small nucleolar RNAs and cajal body-specific RNAs. NucleicAcids Research, 35(suppl 1):D183–D187, 2006.
[139] Chenghai Xue, Fei Li, Tao He, Guo-Ping Liu, Yanda Li, and Xuegong Zhang. Classifi-cation of real and pseudo microRNA precursors using local structure-sequence featuresand support vector machine. BMC Bioinformatics, 6(1):310, 2005.
[140] Zizhen Yao, Zasha Weinberg, and Walter L. Ruzzo. CMfinder—a covariance modelbased RNA motif finding algorithm. Bioinformatics, 22(4):445–452, 2006.
[141] Ying Zhang, Dao-Gang Guan, Jian-Hua Yang, Peng Shao, Hui Zhou, and Liang-HuQu. ncRNAimprint: A comprehensive database of mammalian imprinted noncodingRNAs. RNA, pages –, 08 2010.
[142] Michael Zuker and David Sankoff. RNA secondary structures and their prediction.Bulletin of Mathematical Biology, 46(4):591–621, 07 1984.
[143] Michael Zuker and Patrick Stiegler. Optimal computer folding of large RNA sequencesusing thermodynamics and auxiliary information. Nucleic Acids Research, 9(1):133–148, 1 1981.