computational prediction of target genes of micrornas · computational prediction of target genes...

Computational prediction of target genes of microRNAs

by

M. Hossein Radfar

A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy

Graduate Department of Electrical and Computer EngineeringUniversity of Toronto

Copyright © 2014 by M. Hossein Radfar

Abstract

Computational prediction of target genes of microRNAs

M. Hossein Radfar

Doctor of Philosophy

Graduate Department of Electrical and Computer Engineering

University of Toronto

2014

MicroRNAs (miRNAs) are a class of short (21-25 nt) non-coding endogenous RNAs

that mediate the expression of their direct target genes post-transcriptionally. The goal

of this thesis is to identify the target genes of miRNAs using computational methods.

The most popular computational target prediction methods rely on sequence based de-

terminants to predict targets. However, these determinants are neither sufficient nor

necessary to identify functional target sites, and commonly ignore the cellular conditions

in which miRNAs interact with their targets in vivo. Since miRNAs activity reduces the

steady-state abundance of mRNA targets, the main goal of this thesis is to augment large

scale gene expression profiles as a supplement to sequence-based computational miRNA

target prediction techniques. We develop two computational miRNA target prediction

methods: InMiR and BayMiR; in addition, we study the interaction between miRNAs

and lncRNAs using long RNA expression data.

InMiR is a computational method that predicts the targets of intronic miRNAs based

on the expression profiles of their host genes across a large number of datasets. InMiR can

also predict which host genes have expression profiles that are good surrogates for those

of their intronic miRNAs. Host genes that InMiR predicts are bad surrogates contain

significantly more miRNA target sites in their 3 UTRs and are significantly more likely

to have predicted Pol II-III promoters in their introns.

We also develop BayMiR that scores miRNA-mRNA pairs based on the endogenous

ii

footprint of miRNAs on gene expression in a genome-wide scale. BayMiR provides an

“endogenous target repression” index, that identifies the contribution of each miRNA in

repressing a target gene in presence of other targeting miRNAs.

This thesis also addresses the interactions between miRNAs and lncRNAs. Our anal-

ysis on expression abundance of long RNA transcripts (mRNA and lncRNA) shows that

the lncRNA target set of some miRNAs have relatively low abundance in the tissues

that these miRNAs are highly active. We also found lncRNAs and mRNAs that shared

many targeting miRNAs are significantly positively correlated, indicating that these set

of highly expressed lncRNAs may act as miRNA sponges to promote mRNA regulation.

iii

Acknowledgements

I would like to thank my supervisors and members of my committee Willy Wong, Quaid

Morris, and Zhaolei Zhang. I would like to thank Willy Wong for all his friendly guidance

and support throughout my PhD study. I also would like to thank Quaid Morris whose

curiosity, enthusiasm, support, and suggestions have greatly improved the quality of my

research. Additionally, I thank the members of Morris Lab and Sensory Communications,

both past and present, for valuable discussions and advice. Finally, I would like to thank

Nastaran, my wife, for her endless support, encouragement, and patience.

iv

Contents

1 Introduction 1

2 Background and Literature Review 6

2.1 small RNAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 miRNAs biogenesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 Mechanisms of miRNAs-mediated gene regulation . . . . . . . . . 9

2.3 Identification of miRNA targets . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 miRNA over-expression experiments . . . . . . . . . . . . . . . . 11

2.3.2 miRNA knockdown (antagonism) experiments . . . . . . . . . . . 12

2.3.3 Prediction based on HITS/PAR-CLIP . . . . . . . . . . . . . . . 12

2.3.4 Target prediction using luciferase reporters . . . . . . . . . . . . . 13

2.3.5 Measuring the protein output after miRNA over expression or an-

tagonism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Computational miRNAs target prediction methods . . . . . . . . . . . . 14

2.4.1 TargetScan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4.2 Pictar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4.3 miRSVR-miRanda . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4.4 GenMiR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.5 HOCTAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4.6 COMETA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

v

2.4.7 Sylamer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4.8 MIRZA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Intronic miRNAs and prediction of their targets 26

3.1 Intronic miRNAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Method: InMIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.1 Computing weights for putative miRNA regulators on individual

datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.2 Mapping host gene weights to miRNA weights . . . . . . . . . . . 31

3.2.3 Combining multiple datasets to predict functional targets . . . . . 32

3.2.4 Determining a cutoff value for significant interactions . . . . . . . 34

3.2.5 Predicting miRNA targets using inverse correlation (CORR method) 36

3.2.6 Processing hosts and targets data . . . . . . . . . . . . . . . . . . 39

3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3.1 Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3.2 Detecting good host gene surrogate . . . . . . . . . . . . . . . . . 39

3.3.3 Targeting of host genes by miRNAs partially explains their pre-

dicted surrogacy . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3.4 Correlation measurements are not good indicators of surrogacy . . 46

3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4 BayMiR: a computational miRNA target prediction method 53

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2.1 BayMiR method . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2.2 BayMiR identifies highly repressed targets on miRNA over-expression

assays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

vi

4.2.3 miRNA activity and expression profiles are significantly correlated 63

4.2.4 mRNAs harboring miRNA target sites near the both ends of the

3’ UTR have higher endogenous down-regulation signals . . . . . 65

4.3 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.3.1 BayMiR model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.3.2 Processing mRNA expression Data . . . . . . . . . . . . . . . . . 72

4.3.3 MiRNA-mRNA interaction analysis . . . . . . . . . . . . . . . . 73

4.3.4 Enrichment analysis . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.3.5 Availability of BayMiR and supporting data . . . . . . . . . . . . 74

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5 Impact of miRNAs on long non-coding RNAs 81

5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.2.1 lncRNA targets of some miRNAs have relative low expression in

the tissues in which the miRNAs are highly active . . . . . . . . . 83

5.2.2 lncRNAs that significantly positively correlated with mRNAs may

decoy their common targeting miRNAs . . . . . . . . . . . . . . . 85

5.2.3 Highly expressed lncRNAs in the cytoplasm contain significantly

less seed match sites than those in the nucleus . . . . . . . . . . . 89

5.2.4 High relative number of lncRNA targets in allosomes and chromo-

somes 20-22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.2.5 LncRNAs that contain seed match sites have significantly higher

expression compared to those that lack seed match sites . . . . . . 91

5.2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.3 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.3.1 Microarray data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.3.2 Measuring correlation between mRNAs and lncRNAs . . . . . . . 96

vii

5.3.3 Hyper-geometric test analysis . . . . . . . . . . . . . . . . . . . . 96

5.3.4 Identifying the complementary seed match sites in the lncRNA

transcripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6 Conclusions and Future Work 98

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.2 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.2.1 A Bayesian approach to decipher the TF-miRNA-mRNA-lncRNA

regulatory network . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.2.2 Identifying lncRNA binding sites complementary to mRNA sequences101

6.2.3 Using sequence and expression evidence in parallel . . . . . . . . . 102

Bibliography 104

viii

List of Tables

3.1 The description of symbols used in this chapter . . . . . . . . . . . . . . 28

3.2 InMiR procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.1 Comparison between the expression level of cytoplasmic and nucleic lncRNAs 90

ix

List of Figures

2.1 MiRNAs Biogenesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1 Interaction between hosts, targets, and intronic miRNAs using DAG . . . 32

3.2 Simplified Interaction between hosts, targets, and intronic miRNAs . . . 32

3.3 Plots a-d: the CDFs of the weights wigk . . . . . . . . . . . . . . . . . . . 35

3.4 Receiver Operating Characteristic (ROC) curve analysis . . . . . . . . . 36

3.5 boxplots of weights obtained from the procedure described in Table I . . 38

3.6 the interaction network of target and host genes of intronic miRNAs . . . 41

3.7 Distribution of number of putative targets of intronic miRNAs . . . . . . 42

3.8 the scatter plot of good and bad surrogate host genes . . . . . . . . . . . 43

3.9 Venn diagrams showing overlap between good and bad surrogates . . . . 44

3.10 the CDF of the number of miRNAs targeting host (blue) and non-host genes 44

3.11 Number of intergenic and intronic miRNAs . . . . . . . . . . . . . . . . 45

3.12 Host genes targeted by intronic miRNAs of other hosts . . . . . . . . . . 45

3.13 Correlation coefficients averaged over five correlation datasets . . . . . . 47

3.14 Scatter plots of five correlation datasets . . . . . . . . . . . . . . . . . . . 48

3.15 Intronic miRNAs comprises a significant portion of miRNAs in species . 49

3.16 Regulatory mechanisms of intronic miRNAs . . . . . . . . . . . . . . . . 50

3.17 The host genes targeted by their own intronic miRNAs. . . . . . . . . . . 51

3.18 host gene and intronic miRNA resembles a ”biological switch” . . . . . . 52

x

4.1 BayMiR Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2 BayMiR performance in the miRNA over-expression experiments. . . . . 59

4.3 Cumulative distribution of scores for the validated targets. . . . . . . . . 60

4.4 Comparing BayMiR and Cometa bar plots . . . . . . . . . . . . . . . . . 61

4.5 Comparing BayMiR and Cometa score CDFs . . . . . . . . . . . . . . . . 62

4.6 Validated KEGG pathways . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.7 Enrichment Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.8 WNT signaling pathway. . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.9 KEGG “Pathways in cancer” . . . . . . . . . . . . . . . . . . . . . . . . 66

4.10 miRNA targeting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.11 mRNAs harboring miRNA target sites near the both end of the 3’ UTR . 68

4.12 BayMiR and position contribution scores . . . . . . . . . . . . . . . . . . 77

4.13 BaymiR predicts down-regulated genes in samples not included in training

data. Blue circled line: prediction error on training data and red circled

line: prediction error on test data. . . . . . . . . . . . . . . . . . . . . . . 78

4.14 Estimated (red) and actual (blue) expression profiles of nine genes across

28 test samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.15 The 3‘ UTR of mRNAs harbor many conserved seed matches. . . . . . . 79

4.16 Example of combinatorial regulation masking inverse correlation. . . . . . 79

4.17 Gene expression variability increases as the number of target sites increases 80

5.1 lncRNA targets have low expression in some tissues . . . . . . . . . . . . 86

5.2 lncRNA targets have low expression in some tissues . . . . . . . . . . . . 88

5.3 Highly expressed lncRNAs in the cytoplasm . . . . . . . . . . . . . . . . 90

5.4 Distribution of lncRNAs on the human chromosome. . . . . . . . . . . . 92

5.5 Abundance of target and non-target lncRNAs in 26 different tissues. . . . 93

6.1 Proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

xi

Chapter 1

Introduction

MicroRNAs (miRNAs) are short (21-25 nt) non-coding RNAs that repress the expres-

sion of their direct target mRNAs [1, 2]. miRNAs play critical roles in a wide range of

normal and diseased-related biological processes [3, 4]. miRNAs biogenesis can be con-

cisely described as follows (for details see Chapter II). Primary miRNAs (pri-miRNAs)

are transcribed from intra/intergenic genomic loci and cleaved by Drosha to form ap-

proximately 70-nt hairpin precursors (called pre-miRNAs) that are subsequently cleaved

by the RNase III enzyme, Dicer, to generate miRNA duplexes. One strand of the duplex,

the mature miRNA, is loaded into the RNA-induced silencing complex (RISC) [5] and

guides it to recognize mRNA targets through partial base pairing with the 3’ UTRs of

targets [6].

The presence of targets sites with perfect complementarity to the seed region (po-

sitions 2-7 of 5’ end) of miRNAs is a strong predictor of targeting but it is neither

sufficient nor necessary [6–10]. Over the last decade many other sequence determinants

have been proposed to specify efficient mRNA-miRNA duplexes including: AU composi-

tion flanking target sites [7], thermodynamic stability of binding sites [11], evolutionary

conservation of the seed, [12–14], secondary structure accessibility [5, 15–17], target-site

abundance [18, 19], seed-pairing stability [18], 3’ pairing contribution [7], loop in position

1

Chapter 1. Introduction 2

9-12 of miRNA-mRNA hybrids [9] and the binding location in the 3’ UTR [7, 17]. Due

to the limited number of validated miRNA targets, the exact specificity and sensitivity

of current determinants are unclear [20–23]; however, estimates of precision of these de-

terminants, alone or together, are typically reported to be about 50% at a sensitivity of

6-12% [24, 25], suggesting that sequence-based prediction methods are not fully capturing

miRNA target preferences.

Popular computational miRNA target prediction techniques use these sequence fea-

tures to determine the functional miRNA target sites [2, 7, 8, 12–15, 26–30]. These

techniques however ignore the cellular conditions in which miRNAs interact with their

targets in vivo. Gene expression data are rich resources that can complement sequence

features to take into account the context dependency of miRNAs. In mammals, it is

estimated that miRNAs primarily and dominantly repress the steady-state expression

level of their targets [31–39]. Therefore, down-regulation of an mRNA’s expression when

the miRNA is active is evidence of a functional target site on the gene in vivo. Although

many methods have been introduced to incorporate mRNA and miRNA expression data

into miRNA target predictions, existing methods either require paired miRNA-mRNA

data [40–53], have only been tested in miRNA transfection assays [33, 34, 54], or do not

consider the combinatorial impact of multiple miRNAs on mRNA expression [55, 56].

The main objective of this thesis is to improve the prediction accuracy of sequence-

based miRNA target prediction methods by incorporating large amount of mRNA ex-

pression into miRNA target prediction. We develop computational methods that do not

require miRNA expression data because not all miRNAs are profiled and those did are

noisy, have insufficient replicates, and have inconsistent measurements by different labs

or methods [57].

Recent study has shown that miRNAs may regulate long non-coding RNAs and vice

versa, so indirectly impact mRNA regulation [58–62]. In this thesis, we also study in-

teraction between miRNA, mRNAs, and lncRNAs using the expression abundance of


mRNAs and lncRNAs that contain miRNA target sites in a wide range of tissues.

This thesis consists of three parts. The first part concerns predicting the mRNA

targets of miRNAs located in the intron of protein coding genes (the so-called intronic

miRNAs). We develop InMiR, a computational method that not only predicts the targets

of intronic miRNAs but also determines if a host mRNA level is a good surrogate for the

intronic mRNA level. Because some intronic miRNAs are co-expressed with their host

genes (share the same transcriptional machinery), their expression level may highly be

correlated [43, 45, 63–72]. Accordingly the inverse correlation between host and target

mRNAs of intronic miRNAs may be an indiction of targeting. InMiR applies this notion

into a linear regression model in which the expression abundance of a target mRNA

is expressed in terms of the expression level of host mRNAs whose intronic miRNAs

have seed match sites in the target mRNA. InMiR identifies 1,935 mRNA targets for 22

intronic miRNAs and determines that at least 30 % of miRNAs are co-expressed with

their host mRNAs.

In the second part of the thesis, we develop BayMiR, a Bayesian method that scores

miRNA-mRNA pairs based on the endogenous footprint of miRNAs on genome-wide

gene expression. BayMiR provides an “endogenous target repression” score which iden-

tifies the contribution of each miRNA in repressing a target gene in presence of other

targeting miRNAs. BayMiR relates the changes in the log-transformed expression level

of mRNAs to the activity level of miRNAs. Since miRNA and target mRNA expression

data are anti-correlated [73], for each miRNA, BayMiR uses the negative mean of target

expression levels as an estimate of the activity level of the miRNA. BayMiR analysis was

conducted on 1,539 human miRNAs and the expression levels of 13,303 genes measured

on 5,372 microarray experiments and predicts that approximately 60 % of miRNA-mRNA

duplexes with matched conserved targets sites have detectable down-regulation signal on

gene expression.

In the third part, we study the interactions between miRNAs and lncRNA as well


as the impact of this interaction on mRNAs. lncRNAs are suggested to act as miRNA

sponges and consequently reduce miRNA functionality. In addition, some studies indicate

that miRNAs can regulate lncRNA post-transcriptionally in a similar manner to that of

mRNAs. Our analysis on expression abundance of 7,535 RNA transcripts (mRNA and

lncRNA) across 27 tissues shows that the lncRNA target set of some miRNAs have

relatively low abundance in the tissues that these miRNAs are highly active, suggesting

that miRNAs may modulate the expression of these lncRNAs in some specific tissues. We

also found lncRNAs and mRNAs that shared many targeting miRNAs are significantly

positively correlated, indicating that these set of highly expressed lncRNAs may sponge

the miRNAs to promote mRNA regulation. Our analysis also showed that the lncRNAs

that are highly expressed in the cytoplasm are under selective pressure to have less target

sites compared to those highly expressed in the nucleus, suggesting that miRNAs may

regulate only cytoplasmic specific lncRNAs.

This thesis is organized into six chapters as follows:

• Chapter 2 provides background on miRNA biogenesis and describes popular exper-

imental and computational miRNA target prediction methods. We also discuss the

pros and cons of the these methods.

• Chapter 3 describes InMiR. We verify InMiR performance and compare it with

HocTar, the only available intronic miRNA prediction method. Using InMiR, we

analyze 140 Affymetrix datasets from Gene Expression Omnibus and build a net-

work of 19,926 interactions among 57 intronic miRNAs and 3,864 targets. InMiR

also predicts which host genes have expression profiles that are good surrogates for

those of their intronic miRNAs. We show host genes that InMiR predicts are bad

surrogates contain significantly more miRNA target sites in their 3’ UTRs and are

significantly more likely to have predicted Pol II and Pol III promoters in their in-

trons. By combining our results with previous reports, we distinguish three classes


of intronic miRNAs: Those that are tightly regulated with their host gene; those

that are likely to be expressed from the same promoter but whose host gene is

highly regulated by miRNAs; and those likely to have independent promoters.

• In Chapter IV we introduce BayMiR, a computational Bayesian method that scores

an miRNA-mRNA pair based on the endogenous repression of the mRNA induced

by the miRNA in presence of all other miRNAs that have conserved seed match sites

in the 3’ UTR of the mRNA. We show BayMiR assigns higher scores to predicted

miRNA targets that are more down-regulated in miRNA over-expression assays,

enriched for independently validated targets, and more consistently annotated with

GO and KEGG terms when compared with high-scoring TargetScan targets and

Cometa. In this chapter we also show that validated miRNA targets exhibit high

expression variability and suggests that gene expression variation can also be used

as a score for predicting miRNA targets.

• Chapter V addresses the possible interaction between lncRNA and miRNA by

analyzing the lncRNA expression data measured across 26 tissues. We investigate

whether miRNAs can repress the expression of lncRNAs and if lncRNA can sponge

miRNAs to mediate their functions.

• Finally Chapter VI summarizes thesis achievements, biological importance of re-

search, and gives some direction for future research in this field.

Chapter 2

Background and Literature Review

2.1 small RNAs

The human genome encodes a wide variety of functional elements with either defined

products (e.g., mRNAs) or reproducible biochemical signatures (e.g., transcription fac-

tor binding sites). Non-coding RNAs are a class of RNAs that are distinguished from

messenger RNAs in that they are not translated into proteins. So far approximately

19,000 non-coding RNAs have been annotated in the human genome [74]. Non-coding

RNAs are divided into two sub-categories: long non-coding RNAs (>200 nt) and small

non-coding RNAs. Small ncRNAs are short (18-200 nt) RNAs with roles in almost every

aspect of biology of animals, plants, and fungi. The main small RNAs are:

• Transfer RNAs (tRNAs) typically 73 to 93 nucleotides in length carry amino acids

to the translation machinery

• Small nucleolar RNAs (snoRNAs): they guide the modifications of RNAs; one of

these modifications is methylation; they are highly involved in ribosomal RNA

nucleotide modification.

• Small interfering RNAs (siRNAs) also known as silencing RNAs are 20-25 nt in

6

Chapter 2. Background and Literature Review 7

length double stranded exogenous RNAs. They interfere with the expression of

the genes with complementary nucleotide sequence. There is a subtle difference

between miRNAs and siRNAs in that siRNAs are exogenous either are taken up by

cells or enter via vectors like viruses. In addition siRNAs typically bind perfectly

to their targets. siRNAs are used to validate gene function through transfection

experiments.

• Small nuclear ribonucleic acids (snRNAs) are involved in transcription, splicing,

and formation of precursor mRNAs. They are also associated with small nuclear

ribonucleoproteins (snRNP).

• Piwi-interacting RNAs (piRNAs); they interact with Piwi proteins and silence

genes; their functionality remains largely unknown but recent studies suggest their

role as protecting the genome from invasive transposable elements in the germline

expressed primarily in the testes [75].

• MicroRNAs (miRNAs); they impact many biological processes through post-transcriptional

modulation of gene expression. In the following, we describe the biogenesis of miR-

NAs in details.

2.2 miRNAs biogenesis

MicroRNAs (miRNAs) are a class of short (21-25 nt) non-coding RNAs that play impor-

tant roles in post-transcriptional modulation of gene expression in animals and plants.

miRNAs were first discovered at Ambros lab in 1993 as regulator of developmental tim-

ing in C. elegans [76] using monitoring for mutants with increasing phenotypes but not

distinguished as a distinct class of regulatory genes until 2000. miRNAs are associated

in various aspects of animal development, function as tumor suppressors, oncogenes, and

their phenotypic signatures have been found in various studies during the past 13 years


[4, 65, 77–83].

Herein, we describe the biogenesis of miRNAs in the eukaryotic organisms, especially

in the human. miRNAs are encoded in different loci in the human genome both in

genic and intergenic regions [84]. The transcription of miRNAs proceeds in four or

five steps and takes place in both the nucleus and cytoplasm (Fig. 2.1). Apart from

some miRNAs residing in the introns of protein coding genes, the transcription occurs

as follows: Long primary miRNAs (pri-miRNAs) are transcribed from intra/intergenic

genomic loci by Polymerase II (Pol II) in the nucleus. Primary miRNAs are capped and

polyadenylated to maintain stability and then cleaved by an enzyme called Drosha and

its co-factor Pasha to form approximately 70-nt hairpin precursors (pre-miRNAs). Some

microRNA precursors are modified by enzymes such as Tutases [85]. The pre-miRNAs are

transported into the cytoplasm by exportin-5 and subsequently cleaved by the RNase III

enzyme, Dicer, to generate a 19-25 nt double-stranded duplex. This duplex is loaded into

the RNA-induced silencing complex (RISC)[5]. The entire composition of RISC is not yet

known but Argonaute 1-4 proteins (AGO1-4) along with the mature miRNA are shown

to be the main contributors in gene silencing [86]. The mature miRNA guides RISC to

recognize mRNA targets through partial base pairing with the 3’ UTRs of targets [6].

Finally, the miRNA is released and takes part in another round of regulation. The

transcription process for some miRNAs residing in introns (i.e., intronic miRNAs) is

slightly different. This group of intronic miRNAs are processed from the spliced introns

of their host genes. In this case, introns are folded and make either long or short hairpin

structures which, in the latter case, they directly form the precursor miRNAs and obvi-

ate Drosha incorporation; this latter group is called mirtrons [87]. We discuss intronic

miRNAs later in this thesis in the third chapter.


ANRV324-CB23-08 ARI 24 August 2007 13:45

Figure 1miRNA biogenesis. An miRNA gene is transcribed, generally by RNA polymerase II (Pol II), generatingthe primary miRNA (pri-miRNA). In the nucleus, the RNase III endonuclease Drosha and thedouble-stranded RNA-binding domain (dsRBD) protein DGCR8/Pasha cleave the pri-miRNA toproduce a 2-nt 3′ overhang containing the ∼70-nt precursor miRNA (pre-miRNA). Exportin-5transports the pre-miRNA into the cytoplasm, where it is cleaved by another RNase III endonuclease,Dicer, together with the dsRBD protein TRBP/Loquacious, releasing the 2-nt 3′ overhang containing a∼21-nt miRNA:miRNA∗ duplex. The miRNA strand is loaded into an Argonaute-containingRNA-induced silencing complex (RISC), whereas the miRNA∗ strand is typically degraded.

www.annualreviews.org • microRNA Functions 177

Ann

u. R

ev. C

ell D

ev. B

iol.

2007

.23:

175-

205.

Dow

nloa

ded

from

arj

ourn

als.

annu

alre

view

s.or

gby

Uni

vers

ity o

f T

oron

to o

n 11

/16/

09. F

or p

erso

nal u

se o

nly.

Figure 2.1: (the figure copied from [88]) Primary miRNAs (pri-miRNAs) are transcribed fromintra/intergenic genomic loci and cleaved by Drosha to form approximately 70-nt hairpin pre-cursors (pre-miRNAs) that subsequently cleaved by the RNase III enzyme, Dicer, to generatemiRNA duplexes. One strand of the duplex, the mature miRNA, is loaded into the RNA-induced silencing complex (RISC) and guides it to recognize mRNA targets through partialbase pairing with the 3’ UTRs of targets

2.2.1 Mechanisms of miRNAs-mediated gene regulation

miRNA target recognition in animal and plant is slightly different. In plants, miRNAs

cleave and degrade the mRNA target through nearly perfect Watson-Crick base pairing

to the 3’ UTR region of the mRNA target. In animals, by contrast, miRNAs pair im-

perfectly to their targets and the mechanism by which interfere gene expression is not

well-understood. Overall, two mechanisms for miRNA-mediated regulation of genes have

been suggested: translation inhibition and mRNA degradation [31, 33, 35–37]. MiRNAs


can inhibit the translation of an mRNA to a protein in the initiation, elongation, or

termination stages. Initially, it was thought that translational inhibition is the primary

mode of miRNA regulation in animals and miRNAs destabilize the mRNA target only if

perfect complementary occurs. Later, however, several independent studies showed that

significant portion of the reduction in the protein product (> 84%) is due to miRNA-

induced changes at the transcriptional level [31, 36, 37]. Accordingly, in mammals,

miRNAs primarily and dominantly repress the steady-state expression level of their tar-

gets. Several mechanisms for mRNAs destabilization by miRNAs have been suggested.

mRNA degradation is initiated by a shortening of the mRNA poly(A) tail which eventu-

ally leads to mRNA deadenylation followed by decapping and subsequent exonucleolytic

digestion [35]. mRNAs degradation is often taken place in P-bodies which are enriched

in enzymes involved in mRNA turnover [89]. One of the most abundant elements in

P-bodies is a protein called GW182 [90] which recruits the deadenylase and decapping

complexes and exerts the mRNA destabilization. Deadenylation is the primary effect in

post-transcriptional regulation of mRNAs; miRNAs has been known to promote mRNA

destabilization [91]. In this process, GW182 interacts with Argonaute proteins which

together promote the recruitment of the CAF1-CCR4-NOT1 deadenylase complex, fol-

lowed by decapping and exonucleolytic digestion [92]. MiRNAs may also up-regulate the

expression of a gene indirectly by targeting genes that down-regulate this gene.

2.3 Identification of miRNA targets

Perhaps the most challenging and important issue in the study of miRNAs is identi-

fying the bona fide mRNA targets in animals. The function of a miRNA is specified

by its targets. During the past decade, numerous efforts have been made to improve

miRNA target identification but relatively few mRNA targets have been experimentally

validated. There are several reasons for this incomplete specification of miRNA targets.


First, sequence complementary between miRNAs and their targets is imperfect; short

sequence of miRNAs (19-25 nt) as well as imperfect base pairing with their targets make

hundreds of genes candidate targets for each miRNA, many of which are false positive:

approximately 70% of known genes have predicted putative targets. Also, it is unclear

how RISC elements are recruited and interact to silence the targets. miRNA regulation

is often situation-, time-, or tissue- specific. As such, a gene might only be a functional

target of a miRNA in a specific time and tissue even when there is a sequence comple-

mentary between the target and miRNA. Finally, many of short RNA sequences reported

as miRNAs are actually miRNAs.

There are a wide variety of experimental and bioinformatic methods to determine the

miRNA targets. In the following, we briefly describe these techniques.

2.3.1 miRNA over-expression experiments

In this method, miRNAs are transfected into cells and change the expression level of tran-

scripts are measured using mRNA expression profiling. The transcripts whose expressions

significantly decrease after miRNA transfection are declared targets. One notable break-

through was the experiment conducted by Lim et al [33] who showed that transfecting

a tissue-specific miRNA into HeLa cell shifts the expression profile to that of the tissue

where the miRNA preferentially expressed. miRNA transfection experiments have sub-

sequently been extensively used to evaluate the sequence features proposed for target

identification and validate the functional targets predicted by computational methods

[7, 37, 93–95]. Exogenous miRNAs transfection however perturbs the expression levels of

the targets of miRNAs endogenously expressed in the cell [39]. Thus miRNAs transfection

may cause up-regulation of endogenous miRNA targets, probably due to the competition

for limited number of RISC many of which are taken up by transfected miRNA in the

cell under experiment [39].


2.3.2 miRNA knockdown (antagonism) experiments

In miRNA knockdown experiments, the expression of miRNAs are inhibited using dif-

ferent strategies and subsequently significantly up-regulated transcripts are treated as

targets of the inhibited miRNAs [96]. One approach to inhibit a miRNA is to use syn-

thetic miRNA targets, the so called antimirs [97–99]. Antimirs are chemically modified,

single-stranded nucleic acids designed to specifically bind to and inhibit miRNAs. These

ready-to-use inhibitors can be introduced into cells using tr Other approach is to interfere

in the RISC formation—vital part of the miRNA regulation machinery— and measure the

change in expression level of transcripts in the tissue in which some miRNAs are highly

expressed; those up-regulated are possibly functional targets. Detecting up-regulation

signals in the target set in knockdown experiments is weak compare to down-regulation

signal in the over-expression experiments, making the latter a better choice [98].

2.3.3 Prediction based on HITS/PAR-CLIP

High-throughput sequencing of RNA isolated by crosslinking immunoprecipitation (HITS-

CLIP)[100] and photoactivatable-ribonucleoside-enhanced crosslinking and immunopre-

cipitation (PAR-CLIP) [101] have been applied to determine the binding sites of RISC

proteins mainly AGO/EIF2C1-4. In HITS-CLIP, Argonauts bound RNAs are isolated,

purified, and sequenced to identify sequence regions complementary to the nucleotides 2-7

of 5’ end of miRNAs (seed regions). Alternatively, PAR-CLIP method provides a high-

resolution crossing linking by incorporating photoreactive ribonucleoside analogs into in

vivo RNA transcripts. PAR-CLIP improves the separation of UV-crosslinked target RNA

segments from background when compared with solely CLIP-based methods. In contrast

to previous findings, PAR-CLIP data suggests that CDS regions are also highly enriched

for RBP sites associated with RISCs.

There are however technical difficulties associated with implementing these methods;


furthermore, limited data are available ( target sites for 25 miRNAs), and surprisingly,

predicted targets using CLIP-based methods poorly overlap with the target sets predicted

by popular target prediction methods. In addition, CLIP based experiments can only

identify target regions (about 100 nt) and not the precise position of miRNA binding

sites; moreover, CLIP assays can only be used in cell lines.

2.3.4 Target prediction using luciferase reporters

Luciferase based vectors are commonly used to monitor the expression change of miRNA

targets [102, 103]. Luciferase vectors contain luciferase genes from Renilla or firefly that

emit light. In this method, the 3’ UTR segment of a gene that included miRNA target

sites is inserted between the luciferase coding sequence and the poly(A) signal. Next

luciferase vectors are transfected into a cell line and luciferase activity measured and

compared to that of an analogous reporter with the mutant 3’ UTR sequence. Luciferase

vectors have been extensively used to validate the functional targets [7, 33]. Luciferases

are, however, costly, labour intensive, lack reproducibility between samples, making this

approach unlikely to be scalable to genome-wide determination of miRNA target sites

[104].

2.3.5 Measuring the protein output after miRNA over expres-

sion or antagonism

Since miRNAs inhibit the translation of mRNA to protein, over-expression or loss of a

miRNA in a cell should decrease or increase the protein levels of its target mRNAs. To

quantify this impact, two recent coincident studies revealed the impact of miRNAs on

protein output [36, 37]. In these studies, miRNAs are over-expressed or knocked-down

in a cultured cell and then stable isotope labeling by amino acids in cell culture (SILAC)

followed by mass spectrometry is used to measure the protein level. Together these


works found that : (i) the target genes of most down-regulated proteins are enriched in

hepta-nucleotide motifs associated with the seed regions of transfected miRNAs; (ii) the

opposite effect was observed when miRNAs were knocked down but in much lesser extent

compared to over-expression. (iii) change in the expression level of mRNAs correlates well

with that of their proteins, suggesting that mRNA degradation may be the primary effect

of miRNAs on gene regulation. Proteomic experiments are, however, very expensive and

time consuming and only handle a small fraction of proteins at a time. Moreover they

cannot be used to study the impact of miRNAs on non-coding transcripts.

2.4 Computational miRNAs target prediction meth-

ods

In metazoa, miRNAs pair imperfectly to almost all known transcripts; partial base paring

makes the identification of bona fide targets very difficult. The computational methods

mostly exploit the attributes identified using experimental methods to provide a genome-

wide prediction of the targets of all known miRNAs. Many of the computational meth-

ods have applied the following determinants: perfect Watson-Crick base pairing with the

miRNA seed region (the 2-7 nucleotides on the 5’ end of a miRNA)[6], AU composition

of surrounding sequence [7], thermodynamic stability of binding sites [11], evolutionary

conservation of the seed [12, 13], accessibility of binding sites [5, 7, 15–17], target-site

abundance [18], seed-pairing stability [18], 3’ pairing contribution [7], and binding posi-

tion in the 3’ UTR [7]. In the following, we describe some of most popular computational

methods that use the above determinants to predict the miRNA targets.

2.4.1 TargetScan

The first version of TargetScan was introduced in collaboration between Bartel and Burge

labs at MIT in 2003 [6]. TargetScan has been frequently updated. Release 6.2 launched


in June 2012 predicts the targets of nine mammalian species including human [7, 12, 18].

In addition to target prediction, TargetScan provides a wide range of information and

options about miRNA and transcript sequences; here we focus on the target prediction

aspect of TargetScan. TargetScan predicts a protein coding gene to be a target of an

miRNA if the 3’ UTR of the target harbors a conserved 7 mer or 8 mer motifs that can

base pair the seed region (the 2-7 nucleotides on the 5’ end of a miRNA). Two types of 7

mer motifs are defined: those with exact match to the seed+the position 8 of the miRNA

and those with exact match to the seed followed by an A. 8 mer motifs are those that

match seed+position 8 of miRNAs followed by an A. The 7 mer and 8 mer motifs are

commonly called target sites. A target sites is declared to be conserved if its conserved

branch length score is above a threshold as defined in [12]. In TargetScanS, a refinement

of TargetScan, the efficacy of a target site is specified by eight determinants. The weights

assigned these scores are obtained from over-expression miRNAs experiments where each

score reflects the correlation between the log-fold change of down-regulated transcripts

after miRNA transfection and presence/absence of the given determinant in the down-

regulated transcripts. The scores are as follows:

• Type site contribution which determines score for 7 mer and 8 mer motifs; 8 mer

motifs are assigned higher score since transcripts containing 8 mer are more down-

regulated than those with 7 mer.

• The 3’ contribution: complementarity of the target sequence to a region outside

of the seed (especially nucleotide 13-17 on the 3’ end of the miRNA) improves the

down-regulation.

• Local A+U rich content: the flanking sequences of the seed in down-regulated

targets are enriched for A and U nucleotides.

• Target site position contribution: the more target sites are away from the centre of

the 3’ UTR, the more down-regulation achieved.


• Target abundance : miRNAs whose target sites are enriched in many transcript are

weak regulators because they dilute their effect of target transcript.

• Seed pairing stability: the stability of a miRNA-target duplex determines the effi-

cacy of targeting, a weaker SPS deceases the miRNA targeting.

• Conserved branch length for each site.

TargetScanS adds all individual scores except the conservation score and denotes the

aggregate the context+score. These determinants represent the sequence and location

characteristics of miRNA target sites. TargetScan scans approximately 18,000 genes

(30,000 transcripts) for conserved and non-conserved target sites match to the seed of

about 1,200 miRNAs families (1,500 individual miRNAs annotated by miRBase [105]).

TargetScan identifies half a million conserved target sites for all 1,500 miRNAs. Exploring

interactions identified by TargetScan shows each transcript harbors on average 25 target

sites and each miRNA targets on average 324 transcripts; on the extreme, miRNA hsa-

miR-3163 has seed match target sites in 2575 transcripts and gene TNRC6B contains

507 seed match sties.

2.4.2 Pictar

Pictar was developed by Rajewsky’s group in 2005 [106]. Analogous to TargetScan,

Pictar uses the 5’ end of the miRNA to identify targets but with minor differences.

Pictar defines the seed as a sequence of the length 7 nt starting at position 1 or 2 of the

5’ end of the miRNA. In addition, imperfect base pairing are allowed between the miRNA

and regions in the 3’ UTR of the target (one deletion or one insertion). Pictar applies two

filters to all perfect and imperfect predicted target sites. The first filter retains all target

sites that are conserved across the human, chimpanzee, mouse, rodent, chicken, and fish.

The second one filters out the target sites whose free energy of the entire miRNA:mRNA

duplex are above a threshold. The perfect and imperfect target sites that pass these two


filters are assigned probability p = 0.8 and 1−p#of imperfects sites

, respectively. The target site

probabilities are then used in a hidden Markov model (HMM) to compute the posterior

probability that a 3’ UTR is generated by motifs complementary to the seeds of a set

of miRNAs. The states of the HMM are associated with miRNA binding sites and

background. In this way, Pictar scores reflect the combinatorial targeting of miRNAs. In

this regards, Pictar is, in essence, similar to Ahab, a method developed by the same group

to determine the transcription factor binding sites [107]. Pictar has not been updated

since was first introduced; Pictar predicts 42,073 interactions among 6,108 mRNAs and

132 miRNAs far less than TargetScan. Despite the short coverage of the genome, Pictar

predicts approximately 27 percent of 1,129 experimentally validated targets which placed

it second after TargetScan based on our analysis. Pictar has many tuning parameters

and the sensitivity of prediction to the parameters is unclear and the software is not

publicly available.

2.4.3 miRSVR-miRanda

As explained earlier, TargetScan scores reflect correlation between presence of a miRNA

target site with some particular attributes in the target sequence and the down-regulation

level of the target after over-expressing the miRNA . Inspired by this, miRSVR [8], a

prediction method developed by Leslie’s group in 2010, uses miRanda sequence determi-

nants (as input) and mRNA expression data after miRNA over-expression (as output) to

train a support vector regression classifier for prediction [108] . Given a set of determi-

nants, miRSVR then predicts the expected down-regulation of the targets. miRSVR uses

predicted targets identified by the miRanda, an algorithm that uses dynamic program-

ming to score a 3’ UTR-miRNA duplex based on maximum complementary alignment.

MiRanda applies the following sequence alignment scores: G:C and A:T are +5, +2 for

G:U wobble pairs, -3 for mismatch pairs, and the gap-open and gap-elongation parame-

ters were set to -8.0 and -2.0. Moreover, miRanda scales complementary subsequences to


the 5’ end of miRNA by factor 2 to account for the importance of seed match. Selected

target sites by dynamic programming that pass the free energy of duplex formation and

conservation filters are declared target. miRSVR analysis shows that some targets with

non-conserved, imperfect complementary seed match are significantly down-regulated on

the transfection assays; moreover they showed that the set of experimentally validated

targets are assigned high scores by miRSVR. Although miRSVR claims that it borrows

its strength from the SVR classifier, Garia et al did not gain any performance improve-

ment when they replaced their simple regression classifier with a SVM type classifier [18].

mirSVR scores 680,066 interactions among 17,467 mRNAs and 248 miRNAs; only 710 of

these interactions are experimantally validated and rate of false positive is not clear.

2.4.4 GenMiR

GenMiR was developed in Morris and Frey labs in 2005 [51, 52, 109]. GenMiR integrates

matched mRNA-miRNA expression data into sequence-based prediction methods using

a probabilistic model. GenMiR computes the posterior probability that a target is bona

fide using the product of prior probability and likelihood. The prior probability is ini-

tially obtained from TargetScan predicted targets and learnt when fitting the model. The

likelihood is computed from the expression data using a linear model that relates that the

change in the expression level of the target to those of targeting miRNAs predicted by

TargetScan. GenMiR uses the expectation-maximization algorithm to learn the parame-

ters of the model and to infer the posterior probabilities. Because computing the posterior

probability is intractable, GenMiR applies a variational Bayesian method to replace the

posterior probability with a simpler probability. GenMiR++ is the latest version of the

GenMiR which includes more model parameters to account for the tissue specificity of

miRNAs. GenMiR is the first prediction method that takes into account the multiple tar-

geting effect of miRNAs when incorporating paired miRNA-mRNA expression data into

sequence prediction algorithms. This development had a distinct advantage over meth-


ods that use pairwise correlation between miRNA and mRNA expression vectors since

miRNAs have been shown to co-operatively regulate gene expression. GenMiR however

needs a large number of matched miRNA-mRNA expression data sets to attain an ac-

curate prediction. Moreover since GenMiR applies variational inference, the posterior

probability is simplified to a uni-modal probability which may not capture the variation

in the actual posterior. GenMiR provides high confidence scores for approximately 1,500

mRNAs when applying 104 miRNAs and 88 paired expression data. GenMiR success-

fully predicts experimentally validated targets of the let-7 family, one of the well studied

miRNA family.

2.4.5 HOCTAR

About half of the discovered miRNAs are resided in the introns of protein coding genes—

the so-called intronic miRNAs [110]. A number of mRNA-miRNA expression profiling

experiments have shown that the expression levels of some intronic miRNAs are posi-

tively correlated with those of their host genes, suggesting that they may share the same

transcriptional elements and hence co-expressed [43, 45, 64, 67]. Hoctar, developed in

Banfi’s lab in 2009, scores the target genes of intronic miRNAs based on anti-correlation

between the expression vectors of host and target genes of intronic miRNAs. HocTar

works as follows. HocTar chooses 160 mRNA expression microarray data sets from the

Affymetrix HG-U133A platform. For each intronic miRNA, in each data set HocTar mea-

sures correlation coefficients between the expression vector of the host gene and those of

putative targets of the intronic miRNAs. The putative targets of the intronic miRNA are

obtained by taking the union of targets predicted by TargetScan, miRanda, and PicTar.

HocTar then selects the top 3 percentile of most negatively correlated targets in each

data set. This process is repeated for all 160 data sets and each target is assigned a score

based on its cumulative occurrence in the selected targets (top 3 percentile ) across data

sets. The authors showed that the first 50th percentile of the rank list of predicted tar-


gets are highly enriched for experimentally validated targets as validation test. Although

using the host gene expression levels as surrogate for the expression levels of intronic

miRNAs is a remarkable novelty of HocTar, HocTar has several shortcomings. HocTar

assumes that all intronic miRNAs are co-expressed with their host genes whereas recent

study has shown only 20-40 % of intronic miRNAs do so [111–118]. In addition, Hoctar

computes the correlation coefficients between the individual probes of a host and a target

gene; as such, a host-target pair may have different correlation coefficients; HocTar then

selects the probe with higher negative coefficient and ignores the others. Using the mean

or median of probe expression vector of a gene, which is commonly used, may better

reflects the actual correlation between the expression levels of host and target genes.

Furthermore, the threshold used in Hoctar (top 3 percentile) is not statistically defined.

Lastly, HocTar pools together the target sets predicted by three different methods that

has shown to have least overlap which potentially can increase false positive rate [20].

2.4.6 COMETA

COMETA, developed by Banfi’s group in 2012, is a prediction method that applies a pro-

cedure similar to HOCTAR but not limited to intronic miRNAs [55]. COMETA works

based the assumption that the functional targets of a miRNA tend to be co-expressed,

and identifies co-expressed target sets as follows. Similar to HOCTAR, COMETA pools

together the targets predicted by TargetScan, miRanda, and Pictar and uses 217 microar-

ray gene expression data sets to determine the co-expression targets. For each target gene

of a given miRNA and in each data set, COMETA computes Pearson correlation coeffi-

cients between the target and all other genes on the assay and generates a rank list based

on these coefficients. This process is repeated across all data sets and each gene is as-

signed a score based on its cumulative occurrence on the top third percentile of the ranked

lists of the targets. Carrying out this procedure for all targets of a selected miRNA we

obtain co-expression lists consisting of co-expression scores of all genes with significant


positive correlation with the targets. Finally, COMETA averages the co-expression scores

of a gene across co-expression lists to obtain a co-rank list of targets for each miRNA.

The authors showed that experimentally validated targets significantly place above the

median in the rank lists and concluded that co-expressed targets are possibly functional.

They also observed when grouping the target sets into two clusters using hierarchal clus-

tering of co-expression scores, the co-rank list of one cluster is significantly different from

a co-rank list obtained from a same-size random subset of targets. They showed genes in

this cluster are more down-regulated than those in the other cluster after over-expressing

miRNA-26 and miRNA-98. Using this finding, they built target gene networks for 755

miRNAs and showed some of these networks enriched in some biological processes.

Analogous to HOCTAR, when making the co-expression rank list, COMETA consid-

ers the probe with highest correlation among multiple probe sets representing the same

gene. Moreover, when using union of targets predicted by three prediction methods, a

miRNA target list contains a large number of genes (of order 300-1000); among this set,

it is high likely to find co-expressed genes that participate in some biological processes.

For instance, within a set of 1000 co-expressed genes, 500 may be targets of a miRNA;

this does not implies this 500 genes are co-expressed because they are miRNA targets.

Finally, when performing hierarchical clustering, it is unclear why the targets are always

clustered into two groups and why one group has more consistent co-rank list and the

other lacks.

2.4.7 Sylamer

Sylamer is a prediction method that uses the hypergeometric test to identify if a miRNA

seed match is overrepresented in the top/bottom of a gene list ranked based on their

expression levels after over-expressing/knocking down the miRNA [54]. In other words,

Sylamer is a systematic approach to identify if the down/up-regulated genes harbor

excessively the target sites matched to the seed region of a given miRNA after transfecting


the miRNA into a cell line and profiling the mRNAs. Sylamer works as follows: Let N

denote the number of genes ranked based on their expression levels in a miRNA over-

expression experiment. Let Mi denote the number of genes whose expression levels is less

than an incremental cutoff i× T where i = 1, 2 . . . is incremented till Mi reaches N . Let

SN and SMidenote the number of the miRNA seed match ( i.e. motifs in 3’ UTR of genes

complementary to the seed match of a given miRNA) in all (N) and selected (Mi) genes,

respectively. For each i, Sylamer computes a P-value using a hypergeometric test with

input parameters N , Mi, SN , and SMito identify if SM seed matches are significantly

over-represented in set of Mi genes compared to SN seed matches presented in N genes.

Finally Sylamer generates a curve using computed Pis and searches for a peak on the

curve. Occurrence of a peak at the top of the rank gene list implies that most down-

regulated target sequences are significantly enriched for the seed match and subsequently

they may be functional targets of the over-expression miRNA. The procedure for miRNA

knockdown experiments is the same but the genes are ranked in descending order based

on their expression levels. Sylamer enrichment plots confirmed the overrepresentation

of motifs complementary to seed of miR-155 and miR-430 in the down-regulated genes

after miR-155 and miR-430 over-expression, respectively. Sylamer usage is limited to

miRNA transfection experiments in which the common method for identifying targets

is to compare the cumulative distribution of gene expression levels harboring the seed

match with all other genes [7]. Whether Sylamer outperforms the cumulative distribution

comparison based methods is unclear. In addition, Sylamer does not propose how to

recognize a significant peak rather than visually inspecting the enrichment plot.

2.4.8 MIRZA

Given a set of mRNA fragments cross-linked in AGO-clip experiments and a set of miRNA

sequences, MIRZA models the mRNA-miRNA hybrid structures and estimates the model

parameters. MIRZA infers the model parameters by maximizing the binding probability


of mRNA fragments in Ago-CLIP data [9]. For each miRNA, µ, and mRNA fragment

m, MIRZA defines the target quality R(m|µ) the ratio of the probability m bound to µ

among other target sites and the abundance of m. MIRZA links R(m|µ) to E(µ,m, σ)

defined as the free binding energy of RISC-loaded µ bound to m where the hybrid has the

configuration σ. E(µ,m, σ) consists of two parts: (i)Estr(σ) which depends on different

hybrid structures such as energy of symmetric loop, and (ii) sum of energy of each

hybridized pair. To infer the energy parameters, MIRZA maximizes the likelihood with

respect to these parameters; the likelihood defined as∏

i

∑µR(mi|µ)πµ where πµ is the

probability that a bound RISC loads µ. The model fit by MIRZA shows the highest

tendency for hybridization in the positions 2-7 of 5’ end of the miRNA (seed match)

followed by positions 13-16 and 18-19 ; position 9 is not supported for hybridization by

MIRZA since it opens a symmetric loop of the length 3 nucleotides followed by a bulge.

They used 2,988 51-nucleotide-long mRNA fragments that cross-linked in Ago-CLIP

data. Using the estimated parameters, they predicted the miRNA µ that binds to m by

maximizing the energy term E(µ,m, σ). They showed MIRZA predicts targets as good as

other popular methods when using miRNA-induced log-fold change in transfection data.

MIRZA predicts that many non-canonical target sites might be effective and efficacy of

miRNA targeting depends on their expression levels, i.e. low expression needs perfect

seed match whereas high expression not. MIRZA is in fact a special case of pair HMM

designed for sequence alinement [119]. The hybrid structure predicted by MIRZA was

previously addressed by other groups and the finding that non-canonical sites might be

effective was observed in miRSVR [7, 8]. Nonetheless MIRZA is the first method that

estimates the energy parameters of miRNA-mRNA hybrid using a probabilistic approach.


2.5 Conclusions

Experimental miRNA target prediction approaches are unable to provide genome-wide

prediction of miRNA targeting. As the number of identified miRNAs grows using experi-

mental approaches becomes more limited since these methods are costly, time consuming,

incomprehensive. Bioinformatic methods, on the other hand, can provide a genome-wide

prediction of miRNA targets. During the past decade many miRNA target prediction

methods have been developed. The vast majority of these methods use sequence deter-

minants to predict the target genes of miRNAs. Many performance evaluation studies

have shown that current sequence features alone cannot provide accurate prediction of

miRNA targeting. Using mRNA and miRNA expression data can supplement the se-

quence features to obtain more accurate prediction. Unfortunately, not all miRNAs are

profiled and even with the advent of high throughput sequence techniques measuring

accurate abundance of mature miRNAs remains a challenge. On the other hand mRNA

expression data are abundant, less noisy and available for a wide range of biological

samples. Therefore, augmenting sequence based determinants with mRNA expression

data is a promising notion that can improve prediction of miRNA targets. In this the-

sis, we devise computational miRNA target prediction methods that incorporate mRNA

expression data into sequence prediction methods. We show that our proposed methods

provide better predictive estimates than those reported by the state-of-the-art target

prediction methods. Almost all popular miRNA target prediction methods score the

strength of a miRNA-mRNA pair using sequence evidence. Although these methods

show that these scores correlate with down-regulation of targets in the miRNA trans-

fected experiments, it does not necessary imply the down-regulation of targets in vivo

estimated from mRNA expression data. We, on the other hand, devise computational

methods that score miRNA-miRNA pairs based the down-regulation impact of miRNAs

in vivo. In addition, our scoring strategies include a large number of samples in contrast

to miRNA transfection scoring +methods which are limited to a small number of miR-


NAs and biological conditions. One of the areas that has not been explored in miRNA

target prediction studies is devising computational methods that measure the impact of

miRNAs on non protein coding genes, especially long non-coding RNAs. In this the-

sis, we also try to predict the lncRNAs that are targeted by miRNAs using mRNA and

lncRNA expression data sets.

Chapter 3

Intronic miRNAs and prediction of

their targets

3.1 Intronic miRNAs

Approximately half of mammalian miRNAs are hosted within the introns of protein-

coding genes, so it may be possible to predict the targets of some of these intronic miR-

NAs without having to measure their expression level. Indeed, many intronic miRNAs

appear to lack their own promoters and are processed out of introns[43, 45, 63–72]. Esti-

mates for the proportion of intronic miRNAs whose expression profiles are significantly

correlated with their host gene vary between 34% (25/74 [43]) and 71% (22/31 [67]). If

these co-expression relationships can be detected without having to measure the miRNA

expression, then host gene expression levels can be used as a surrogate for the miRNA

levels when doing target prediction [56]. There are substantial advantages to doing this.

First, host gene expression levels are measured at the same time and on the same plat-

form as the target gene expression levels, thus removing the need to model platform and

laboratory-based effects. Also, there are hundreds of suitable Gene Expression Omnibus

datasets for well-studied model organisms that can be used for target prediction, thus

26

Chapter 3. Intronic miRNAs and prediction of their targets 27

adding considerable statistical power to any target predictions.

However, not all host gene expression profiles are useful for predicting the targets of

their intronic miRNAs. Some of these intronic miRNAs show evidence of having their own

promoter[111–118]. For example, two independent studies found putative promoters for

one-third of intronic miRNAs [111, 112]. Furthermore, host gene mRNAs may themselves

be under post-transcriptional regulation by other miRNA. As such, it is important to

distinguish host genes with expression profiles that are good surrogates for those of their

intronic miRNAs from those that are not.

In this chapter, we propose a new method that both identifies intronic miRNAs

whose host gene’s expression provide good surrogates for their expression level as well

as predicting the mRNA targets of these miRNAs. Our method takes as input a set of

potential miRNA target sites based on sequence comparisons and then among these sites

it identifies those likely to be functional sites based on the degree to which host gene’s

expression is predictive of down-regulation of the mRNA. When predicting regulators

of a particular mRNA, we consider the combined effect of all of its potential regulators

because most miRNAs are regulated by multiple miRNAs [20, 51, 106, 109, 120]. Our

method can use any mRNA expression profiles, however, here we use 140 gene expression

data series chosen for their size and their use of the same microarray platform. We

distinguish between good and bad host gene surrogates based on the proportion of their

hosted miRNA’s potential targets that we predict to be functional. Host genes that we

deem to be bad surrogates based on this test have more predicted Pol II/III promoters

in their introns as well as more predicted miRNA binding sites in their 3’ UTRs.

3.2 Method: InMIR

We modeled the change of an mRNA’s expression level in a sample by a linear combina-

tion of the host gene expression levels of a subset of the miRNAs with potential target


sites in the 3’ UTR of the mRNA. We distinguished the functional and non-functional

target sites by fitting this linear model to expression profiling data from a large number of

studies and then examining the distributions of weights assigned each potential miRNA

regulator.

This linear modeling approaches differs from previous ones [51, 106, 109] in a number

of important aspects. First, we use host gene expression levels as surrogates for miRNA

expression levels. Also, we predict functional and non-functional sites by integrating

evidence from multiple profiling studies rather than a single study. This change allows us

to employ a much simpler linear model for each individual dataset because we need not

rely upon prior assumptions to detect statistical signals of regulation. The parameters

of our model can be easily estimated using ordinary least squares linear regression. In

the following, we describe our methodology and obtained results in detail.

3.2.1 Computing weights for putative miRNA regulators on in-

dividual datasets

Table 3.1: The description of symbols used in this chaptersymbol Descriptiong gene indexk miRNA indexi dataset indexG # of target genesKg # of putative targeting miRNAs for gene gT # of samplesni noise vector corresponding to dataset ixig expression of gene g in dataset iHig a matrix containing the expressions of host genes in dataset i

hikg expression of the gene hosting miRNA k that targets gene g in dataset i∆xig change in expression level of gene g in dataset iwig regulatory weights of miRNAs targeting gene g in dataset i

Our linear model is as follows: Given N gene expression datasets Di, i = 1, . . . N


(see materials and Table S1), let ∆xig = {∆xitg}Tt=1 denote an T -element vector whose

elements correspond to the decrease in the expression level of the gth target gene over

T samples in the ith dataset. We model this vector as a linear function of Kg intronic

miRNAs whose host gene expression levels are denoted by hikg = {hitkg}Tt=1, k = 1, . . . , Kg.

These intronic miRNAs represent putative regulators of the mRNA identified based on

a sequence-based miRNA prediction algorithm, such as TargetScan. Based on the above

assumptions and definitions, we build the following model:

∆xi1g

∆xi2g...

∆xiTg

target gene

= wi1g

hi11g

hi21g

...

hiT1g

+ wi2g

hi12g

hi22g

...

hiT2g

+ . . .+ wiKg

hi1Kgg

hi2Kgg

...

hiTKgg

︸︷︷︸

the contribution of the intronic miRNAs

+

ni1

ni2...

niT

noise

(3.1)

where wikg, k = 1, . . . , Kg is a weight that represents the contribution of the kth intronic

miRNA in regulating the target gene g and ni = {nit}Tt=1 represents modeling error or

noise. Typically, we cannot measure ∆xikg directly, so we approximate it by the difference

between the mean mRNA expression level in the sample and the measured level of xikg,

i.e., ∆xikg = −(xikg − 1G

∑Gg=1 x

ikg) , where G denotes the number of genes in the dataset.

We also assume that the noise vector is sampled from a multivariate Gaussian distribution

whose covariance matrix is proportional to the identity matrix, i.e., is spherical. Equation

(3.1) can be written in matrix-vector notation as

∆xig = Higw

ig + ni, i = 1, . . . , N (3.2)

in which Hig = [hi1gh

i2g . . .h

iKgg

] denotes the expression data of Kg host genes over T

samples.

In the model, a positive weight, wikg, indicates the contribution of the host gene k


in decreasing the expression level (∆xig) of the target gene g . Analogously, a negative

weight, wikg, indicates the contribution of the host gene k in increasing the expression

level (∆xig) of the target gene g . We call this the unconstrained linear model (ULM)

to distinguish it from previous models [51, 109] that constrain the weights wi to be

positive thereby insisting that miRNAs act only to down-regulate the expression of their

target genes. We relax this constraint for convenience because doing so simplifies the

fitting procedure without impacting the predictions of the model. In this chapter, we

focus on the down-regulation role of miRNAs as only few miRNAs have been reported

to up-regulate target gene expression [121, 122].

Under these assumptions, we can estimate wig using ordinary least squares linear

regression, i.e., we minimize the root mean squared error between the reconstruction of

the mRNA down-regulation profile based on the miRNA estimates and the observed one,

i.e.,:

wig = arg min

wig

(∆xig −Higw

ig)>(∆xig −Hi

gwig) (3.3)

where > denotes the matrix transpose operation. Note that the solution to equation

(3.3) corresponds to the maximum likelihood estimate of wi. The maximum likelihood

estimate of wik is given by

wig = arg max

wig

p(∆xig|wig,H

ig). (3.4)

The vector ng is modeled by a zero mean white Gaussian noise of the form

pn(ng) ∼ N (0,Σn) =1

|2πΣn|T2

exp(−1

2n>g Σ−1

n n). (3.5)

If we assume that the noise process has a diagonal covariance matrix of the form Σn = σ2I


where I denotes the identity matrix, then maximum likelihood function is given by

p(∆xig|wig,H

ig) =

1

(|2πσ2|)T2

exp(− 1

2σ2(∆xig −Hi

gwig)>(∆xig −Hi

gwig)). (3.6)

Thus, maximizing the log of p(∆xig|wig,H

ig) is equivalent

wig = arg min

wig

(∆xig −Higw

ig)>(∆xig −Hi

gwig) (3.7)

We solved (3.3) individually in each dataset to obtain N wig vectors for the target

gene g. In order to be able to compare weights across datasets, we rescaled the weights

for each mRNA within each dataset by dividing each element in wig by the sum of the

absolute values of its elements, i.e.,∑N

i=1 |wig| thus ensuring that −1 ≤ wikg ≤ 1, ∀i, k.

In the next section we describe how we combine weights from multiple datasets to make

a single prediction for each putative miRNA and mRNA interaction.

3.2.2 Mapping host gene weights to miRNA weights

Our model uses host gene expression as a surrogate for the expression level(s) of its

intronic miRNAs. This requires us to resolve some of the host gene / intronic miRNA

relationships that are not one-to-one, because some host genes contain multiple intronic

miRNAs and some intronic miRNAs are duplicated in more than one host gene. Fig. 3.1

shows a directed acyclic graph (DAG) representing these relationship for eight intronic

miRNAs that are possible regulators for the expression of gene LSM12 whose protein

product accumulates in stress granules [123]. This DAG can be interpreted as a graphical

model in which the expression patterns of intronic miRNAs are hidden. Because our goal

is not only to predict miRNA targets but also to determine which host genes are good

surrogates for their intronic miRNAs, we assign weights directly to host genes rather

than miRNAs. So, the host genes of duplicated miRNAs get separate weights. Also,


Figure 3.1: Interaction between hosts, targets, and intronic miRNAs using DAG. A directedacyclic graph (DAG) that represents interactions between host genes, intronic miRNAs, andthe target. The top nodes represent the host genes. The middle layer represents the intronicmiRNAs located in the introns of the host genes at the first layer. And the bottom layerdenotes the target gene. In this DAG, the gene LSM12 is targeted by intronic miRNAs miR-19a, miR-19b,miR-26a,miR-26b, miR-27b, miR-214, miR-340,and miR-874 which are located inthe introns of CTDSP2, CTDSPL, MIRHG1, CTDSP1, C9orf3, RNF130, DNM3, and KLHL3.

Figure 3.2: The simplified DAG of Fig. 3.1 in which host genes have a direct interaction withthe target.

when a host gene contains more than one intronic miRNA with putative targets in a

given mRNA, we assign this host gene weight to each of these miRNAs. The host gene /

target mRNA model that we fit for LSM12 after making these adjustments is shown in

Fig. 3.2.

3.2.3 Combining multiple datasets to predict functional targets

We make our predictions of functional targets by comparing the distribution of weights

assigned to a host gene / mRNA pair across the datasets to a distribution in which the


association between host genes and their expression profiles is randomized. Specifically,

we generate a null distribution of weights by permuting the labels of the host genes

and re-calculating the weights for all putative pairs in every dataset. All of the weights

calculated during this process comprise the empirical null distribution. Then for each

host gene / mRNA pair, we compare the distribution of weights for this pair against this

null distribution by calculating the two-sided Wilcoxon-Mann-Whitney (WMW) ranksum

P-value, we call this value Pkg for the k-th host gene and the g-th mRNA. We also record

whether the mean of the distribution of real weights for a given pair is larger or smaller

than the mean of the null distribution. The means of the weight distributions that are

larger than random reflect a prediction by our model that a miRNA associated with the

host gene is down-regulating the target mRNA. As we will describe later, we use host

gene / mRNA pairs whose weights are smaller than random when distinguishing good

and bad host gene surrogates.

We interpret Pkg as an enrichment measure and determine a cutoff value, for both

positive and negative enrichment, by comparing it to P-values calculated for host gene /

mRNA pairs that are unlikely to interact. We generated P-values for these likely negative

examples by calculating a two-tailed WMW P-value, Qkg, for each putative host gene

/ mRNA pair as described above except that we replace the actual weight distribution

with that we computed after permuting the host gene labels. Formally, we define Pkg

and Qkg as follows:

Pkg = WMW({wikg}Ni=1,

{{qikg}Kk=1

}Ni=1

)(3.8)

Qkg = WMW({qikg}Ni=1,

{{qikg}Kk=1

}Ni=1

)(3.9)

where WMW(S, S ′)

)is a function that calculates a two-tailed WMW P-value for sets S

and S ′ and {qikg} is the set of weights fit to the permuted data.

Fig. 3.3.a-d show the CDFs of weights (i.e. wigk and qigk ,∀k) for all host genes whose

intronic miRNAs have potential target sites in LSM12. The CDF of the pooled weights


obtained from the permuted data (the thick gray line) is also shown. These weights

were obtained from two methods: ULM (Fig. 3.3.a-b) and a method that sets weights

by correlation (Fig. 3.3.c-d) (the CORR method, see materials for details). Recently,

the HOCTAR method was introduced that uses inverse correlation with host genes to

detect intronic miRNA targets [56]; here we use the CORR method to demonstrate how

well inverse correlation performed within our framework. From Fig. 3.3.c-d, we see that

the distributions obtained from CORR from the actual and permuted data are almost

indistinguishable suggesting that CORR is unpowered and/or prone to misclassification

compared to ULM. Moreover, these observations also confirm the cooperative impact

of miRNAs on target genes. By contrast, the distributions of three host genes, namely

CTDSP1,CTDSP2, and CTDSPL, obtained from ULM—also from constrained linear

model (CLM) (Fig.S4)—are significantly different from their permuted counterparts and

the pooled distribution. The table at the bottom of Fig. 3.3 lists Pkg and Qkg for each

interaction. In the next subsection we specify a cutoff point in order to determine the

significant interactions that we will be using to make predictions about targets.

3.2.4 Determining a cutoff value for significant interactions

We apply ROC analysis to determine a cutoff point for specifying significant Pkg. Fig. 3.4

shows the ROC curves for the ULM and CORR methods when we use − logPkg as the

discriminant values for the positive examples and − logQkg for the negative examples.

By using a cutoff of 0.01 for the ULM Pkg values, we are able to achieve a sensitivity

of 32% at 100% predicted specificity. In other words, 32% of interactions predicted by

TargetScan are assigned weights whose distributions are more distinguishable from a

random distribution than any of those assigned the permuted host gene / mRNA pairs.

If we insist on 100% specificity, CORR only recovers 17% of the TargetScan predicted

host gene / mRNA interactions; achieving 32% sensitivity with CORR requires lowering

the specificity to 94%. The corresponding cumulative distribution of these log P-values


-0.5 0 0.5 10

0.2

0.4

0.6

0.8

1a: ULM

weights

CD

F

-0.5 0 0.5 10

0.2

0.4

0.6

0.8

1b: permuted ULM

weights

CD

F

-1 -0.5 0 0.5 10

0.2

0.4

0.6

0.8

1c: Corr

weights

CD

F

-1 -0.5 0 0.5 10

0.2

0.4

0.6

0.8

1d:permuted Corr

weights

CD

F

Line color Host gene miRNA ULM Perm‐ULM Corr Perm‐Corr

C9orf3 miR‐27b 1.2x10‐2 2.5x10‐2 6.6x10‐1 5.3 x10‐1

CTDSP1 miR‐26b 3.1 x10‐8 2.9 x10‐1 6.5 x10‐1 3.5 x10‐2

CTDSP2 miR‐26a‐1 1.7 x10‐4 9.6 x10‐1 9.8 x10‐1 2.0 x10‐1

CTDSPL miR‐26a‐2 2.1 x10‐5 1.2 x10‐1 2.3 x10‐1 1.7 x10‐1

DNM3 miR‐214 3.4 x10‐2 3.1 x10‐1 3.5 x10‐1 8.4 x10‐1

KLHL3 miR‐847 3.8 x10‐1 7.3 x10‐1 2.5 x10‐2 3.3 x10‐1

RNF130 mir‐340 3.1 x10‐1 2.3 x10‐1 5.0 x10‐2 5.9 x10‐1

PermMean ‐ ‐ ‐ ‐

Pcutoff =10‐2

Target gene: LSM12

Figure 3.3: Plots a-d: the CDFs of the weights wigk (a-b) and ρigk, (c and d)∀i, g for seven

host genes obtained from ULM (a and b), and CORR (c and d) with the actual (a and c) andpermutation setups (b and d). The thick gray line in each plot is the CDF obtained from thepooled permutation data for each method. The Table lists the p-values (Willcoxon ranksumtest) showing the probability that the weight or correlation data are drawn from the pooledpermutated data (see (3.8) and (3.9) for detail). P-values marked in red are predicted to besignificant (P < 0.01). It should be noted that the host gene MIRHG1 was excluded for analysissince the expression data related this host gene did not exist in the retrieved dataset.


0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False positive (1-Specificity)

True

pos

itive

rate

(Sen

sitiv

ity)

ROC curve

ULMCORRRandom

cutoff=0.01

Figure 3.4: Receiver Operating Characteristic (ROC) curve analysis to determine the cutoffpoint. We set the cutoff point to 0.01 (− log10 0.01 = 2) to identify significant host-targetinteractions. The blue, red, and black curves show the ROC associated with ULM, CORR, andrandom, respectively.

is shown in Fig.S1-2. In the example in Fig. 3.3, detect significant interactions between

CTDSP1 and LSM12 (P-value=3.1 × 10−8(ULM)), between CTDSP2 and LSM12 (P-

value=1.7 × 10−4 (ULM)), and between CTDSPL and LSM12 (P-values=2.1 × 10−5

(ULM)) significant. Fig. 3.5 shows the boxplots of weights of 7 host genes whose intronic

miRNAs putatively target LSM12.

3.2.5 Predicting miRNA targets using inverse correlation (CORR

method)

Gennarino and colleague [56] recently described an algorithm, HOCTAR, that predict

intronic microRNA targets based on inverse correlation of their host genes with other


Table 3.2: InMiR procedure

for g = 1 : G(number of target genes)

Ifind all intronic miRNAs which putatively target g using TargetScan

Imap intronic miRNAs to their host genes ,k = 1, . . . ,Kg

for i = 1 : N( number gene expression datasets)

Iextract the expression data of the host genes, Hig

Iextract the expression data of the target gene, xig

I solve wig = arg min

wig

‖∆xig −Higw

ig‖

I permute the rows using a permuted matrix, M , to get MHig

I solve qig = arg minqig

‖∆xig −MHigrw

ig‖

end

for k = 1 : Kg

I compute the P-values:

Pkg = WMW({wikg}Ni=1,

{{qikg}Kk=1

}Ni=1

)Qkg = WMW

({qikg}Ni=1,

{{qikg}Kk=1

}Ni=1

)end

end

Iset two classes of data I:{Pkg|∀ i, g, k} and II:{Qkg|∀ i, g, k}

Iplot ROC curve and determine a cutoff point (Pcutoff) to get almost zerofalse positive

Ideclare the interaction between host gene k and target gene g significantif Pk,g < Pcutoff


-0.4

-0.2

0

0.2

0.4

0.6

miR

-26b

---C

TD

SP

1

miR

-26a

-1--

-CT

DS

P2

miR

-340

---R

NF

130

miR

-214

---D

NM

3

miR

-847

---K

LHL3

miR

-27b

---C

9orf

3

miR

-26a

-2--

-CT

DS

PL

miRNAs---Host Genes

* *

Wei

ghts

val

ues

Targeted Gene:LSM12

Median of permutated data

Figure 3.5: Shown are the boxplots of weights obtained from the procedure described in TableI. The significant negative interactions, i.e. those with P < Pcutoff and meangk > random, haveasterisk marks. The horizontal dashed line indicates the median of weights obtained from thepermutation test.

mRNAs across a large number of datasets. As we have previously demonstrated [52],

linear models that consider the impact of multiple potential miRNA regulators generate

more accurate target predictions than simple correlations, consistent with recent obser-

vations of miRNA-target interactions [20, 120]. To assess whether these observations

hold for target predictions based on host gene expression, we also assessed a version of

our method in which we replace the weights with correlations. The resulting algorithm

is very similar to HOCTAR.

In particular, we denote the correlation coefficient by ρigk = corr(xig,hik), ∀i, k, g

where corr(·, ·) represents the Pearson correlation coefficient. We then use these correla-

tions ρigk for real and permuted datasets in the place of weights to calculate the P-value

based enrichment measures as described in Section II.C. We call this method as CORR.


3.2.6 Processing hosts and targets data

We retrieved the mirRBase V.16 gene context repository and extracted all human intronic

miRNA-host gene association. We also downloaded 140 gene expression datasets (GDS

V.2011) from Gene Expression Omnibus (GEO) which were built on the Affymetrix

HG-U133 microarray platform [56] using MATLAB function getgeodata.m (Table S1 and

materials). Only those probe IDs that could be mapped to gene symbols (according to

HGNC) were considered for analysis. We averaged the expression levels of all transcripts

per gene. We used the list of putatively predicted target genes (9448) and their intronic

miRNAs (134) from the TargetScan (release 5.1) repository.

3.3 Results

3.3.1 Data set

140 curated gene expression data sets, called GDS, were downloaded from Gene Expres-

sion Omnibus (GEO) using the MATLAB Bioinformatics toolbox function getgeodata.m.

The list of these GDSs are given in Table S1. Each dataset is then processed as follows.

First, we excluded those genes for which we have missing values. Then we filtered out

genes with absolute values less than 10th percentile using MATLAB function genelow-

valfilter.m. The expression profile related to the host gens are normalized so that all

have length one. Mathematically this means higk ←higk

‖higk‖,∀i, k, g. For the target genes,

we obtain the decrease in expression level as ∆xg = xg −xg where xg = 1kg

∑Kg

k=1 xgk,∀g.

3.3.2 Detecting good host gene surrogate

Using the method described in the last section, we defined for each host gene a set of sig-

nificant interactions between the host gene’s expression level and those of the predicted

targets of its associated intronic miRNAs (i.e. those for which Pkg < Pcutoff). Further-


more, we know whether that an interaction is a ”negative” one when the mean of weights

over all datasets ( i.e. mean(wkg) = 1N

∑Ni=1w

ikg) is larger than random expectation or

a ”non-negative” one, when the mean is smaller than random expectation. When we

examine all the significant interactions between a host ( or equivalently its miRNA) and

its predictive targets, we find that these interactions are almost exclusively negative or

non-negative.

We retrieved and processed the expression profiles of 75 host genes and 3864 target

genes (see materials and Table S3 ) over 140 datasets. For all target genes (G = 3864),

we carried out the procedure given in Materials subsection 5 for obtaining p-values for

ULM, CLM, and CORR methods. All of these p-values are available in Table S3. We

report the results for ULM, the significant interactions from CLM are similar and, as

we described in the last section, using CORR reduces our sensitivity or specificity or

both. After applying the cutoff at P = 0.01, we find that 22 (29%) host genes have more

negative interactions than positive ones. Those host genes and their 1935 target genes

are shown in Fig. 3.6.

Fig. 3.7 shows the number of TargetScan-predicted targets for each of these 22 host

genes, along with the number of significant interactions for these predicted targets and

the number of these significant interactions that are negative. As shown, for 21 out of 22

host genes, almost all interactions are negative (equal light green and yellow bars). We

take this as evidence that the host gene expression level is a good surrogate for that of

its intronic miRNAs. Indeed when we consider all of the host genes with any significant

interactions, we find that they fall into two main classes: those whose interactions are

almost exclusively negative and those that are non-negative (Fig. 3.8). Furthermore,

those that are non-negative are highly enriched for those with possible promoters, as

predicted by sequence analysis in [111], for their intronic miRNAs (Fig. 3.8 and Fig. 3.9).

We also observe that significantly negatively enriched host genes have, on average, high

mean p-values (blue circles). For instance, 7 out of 8 host genes, namely HNRNPK


Figure 3.6: A gene-gene interaction network of target and host genes of intronic miRNAs withsignificant negative interactions. Each green and red node shows a host and target gene, respec-tively. An edge indicates that there is a significant negative interaction between two nodes, i.e.meangk > random and Pkg < Pcutoff. The size of each host node is proportional to the number ofthe edges connected to it. Host–intronic miRNAs pairs are: MCM7–miR-106b/93/25, LARP7–miR-367/302a/302b,LARP7–miR-302c/d, RNF130–miR-340,PPIL2–miR-130b/301b,HUWE1–miR-98/let-7f, CTDSP2–miR-26a, CTDSP1–miR-26b, RCL1–miR-101,COPZ1–miR-148b,PANK2–miR-103,TRPM3–miR-204, DNM2–miR-199a/638, IARS2–miR-215/194,HNRNPK–miR-7, SREBF2–miR-33a, WWP2–miR-140, DALRD3–miR-425/191, EVL–miR-342, LPP–miR-28, ACADVL–miR-324,KIAA1797–miR-491, C3orf60–miR-191.

, COPZ1, HUWE1, PANK2, ACADVL, LARP7,and IARS2 appear at the top of the

ranked mean p-value list . Thus, significantly negatively interactions and high mean p-

values are two determinants which may provide strong evidence for detecting co-expressed

host-intronic miRNA pairs.


0

100

200

300

400

500

600

700

800

900

1000

1100

1200

Host genes--Intronic miRNAs

Num

ber

of t

arge

t ge

nes

MCM

7--m

iR-1

06b/

93/2

5

LARP7-

-miR

-367

/302

a/30

2b

LARP7-

-miR

-302

c/d

RNF130-

-miR

-340

PPIL2-

-miR

-130

b/30

1b

HUWE1-

-miR

-98/

let-7

f

CTDSP2--m

iR-2

6a

CTDSP1--m

iR-2

6b

RCL1--m

iR-1

01

COPZ1--m

iR-1

48b

PANK2--m

iR-1

03

TRPM3-

-miR

-204

DNM2-

-miR

-199

a/63

8

IARS2-

-miR

-215

/194

HNRNPK--miR

-7

SREBF2--m

iR-3

3a

WW

P2--m

iR-1

40

DALRD3-

-miR

-425

/191

EVL--m

iR-3

42

LPP--m

iR-2

8

ACADVL--m

iR-3

24

KIAA17

97--m

iR-4

91

C3orf6

0--m

iR-1

91

# of putative targets (using TargetScan)# of putative targets with P

value<P

cutoff

# of putative targets meet where: Pvalue

<Pcutoff

& mean(w)<mean(rw)

Figure 3.7: Each dark green bar shows the number of putative targets—obtained fromTargetScan—of intronic miRNAs of the corresponding host gene labeled in the x-axis. Lightgreen bars indicate the number of putative targets which satisfy the condition Pgk > Pcutoff (sig-nificantly regulated). Number of putative targets that meet the both conditions Pgk > Pcutoff

and meangk > random (significantly negatively regulated), are shown by yellow bars.

3.3.3 Targeting of host genes by miRNAs partially explains

their predicted surrogacy

Even if a host gene and intronic miRNA are expressed from the same promoter, they

could have different expression levels due to different post-transcriptional regulation. To

investigate this, we examined the predicted miRNA targets within the 3’ UTRs of host

genes. We found host genes are targeted by miRNAs much more than non-host genes


0% 25% 50% 75% 100%

8

12

16

20

Good surrogate hosts

Bad surrogate hosts

Hosts whose intronic miRNAs have predicted promoters

-log 10

p-v

alue

s

Percentage of negatively enriched targets

Figure 3.8: Each circle, associated with a host, shows the mean of − log10 p-values of theenriched genes vs the percentage of negatively enriched genes targeted by the intronic miRNAsof host genes. The blue and red circles are associated with good and bad surrogate host genes,respectively. The circles corresponding to the hosts whose intronic miRNAs have predictedpromoters marked by yellow triangles.

(P < 10−22, Wilcoxon ranksum test) though we were unable to detect a preference for

targeting by intronic versus intergenic miRNAs (Fig:FigS5b). However, we found that

negatively enriched host genes have significantly fewer (P < 0.02, Wilcoxon ranksum

test) miRNA targets than non-negatively enriched hosts (Fig. 3.11). So, down-regulation

of the host gene by other miRNAs could provide another possible explanation for why

some host expression levels are bad surrogates for those of their intronic miRNAs. The

pattern of interactions among host genes and their intronic miRNAs suggests that there

may be some hierarchical structure in intronic miRNA-based regulation (Fig. 4.17).


16 143

39 13 4

Good surrogate hosts

Bad surrogate hosts

Hosts whose intronic miRNAs have

independent promoters

Figure 3.9: Venn diagrams showing overlap between good and bad surrogate host genes andhosts whose intronic miRNAs have predicted promoters.

0 0.05 0.1 0.15 0.20

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

CD

F

# of putative targeing miRNAs/base pair

P<7.2723e-022

Host genesnon-host genes

Figure 3.10: the CDF of the number of miRNAs targeting host (blue) and non-host genes (red)per base; that is, number of target / 3’UTR length. The CDFs are obtained from analyzing367 host genes and 17000 non-host genes.


0

10

20

30

40

50

60

Host genes

CL

CN

5C

TD

SP

LW

DR

82

KL

HL

3C

9o

rf5

LP

PC

AL

CR

PA

NK

3C

TD

SP

2A

NK

1H

OX

A9

CH

RM

2T

LN2

AA

TK

DN

M2

HT

R2

CM

ES

TA

ST

N1

DN

M3

GA

BR

EN

R6

A1

WW

P2

AP

OL

D1

CT

DS

P1

HN

RN

PK

GP

C1

HO

XC

5P

TP

RN

2T

RP

M3

AR

RB

1C

17

orf

91

CH

MC

OP

Z1

ME

GF

6A

CA

DV

LE

ML

2F

GF

13P

DE

2A

RC

L1

VP

S1

3B

C9

orf

3D

AL

RD

3D

NM

1H

UW

E1

SL

IT3

SM

C4

SR

EB

F1

AC

10

68

64

.1C

3o

rf6

0C

6o

rf1

55

DL

EU

2E

VL

HO

XC

4IA

RS

2K

IAA

17

97

LA

RP

7M

CM

7M

IRH

G1

MY

H7

BP

AN

K2

PP

IL2

PT

PR

NR

NF

13

0S

LIT

2S

RE

BF

2T

OP

3B

TR

PM

1

Num

ber

of p

utat

ive

targ

etin

g m

iRN

As

IntronicIntergenicGenes predicted to be good surrogates

Figure 3.11: Number of intergenic and intronic miRNAs that putatively target our set of hostgenes. Bars marked by red circles are associated with the genes predicted to be good surrogates.

Figure 3.12: Host genes targeted by intronic miRNAs of other hosts. The nodes correspondingto the hosts predicted to be good surrogates are shown in red.


3.3.4 Correlation measurements are not good indicators of sur-

rogacy

Correlation between the expression patterns of the host genes and their intronic miRNAs

in a single dataset are not a good indicator of surrogacy. We observed that correlation

measurements reported by five different groups are highly non-overlapped and somehow

inconsistent. Only 11 host-miRNA pairs show high positive correlation (ρ > 0.4) at

least in two of these five datasets (Fig. 3.13). Out of these 11 host genes, 4 host genes

are predicted to be good surrogates by our model. While the intronic miRNAs of none

of these 4 hosts have promoters, 6 out of 7 hosts predicted to be bad surrogates have

intronic miRNAs with promoters (Fig. 3.13). Thus, 7 highly correlated host-intronic

miRNA pairs pass neither our criteria nor the promoterless condition.

We collected the correlation results reported by Wang et al. [43], Liang et al. [67],

Baskerville et al. [64], and Ruike et al. [45]. Wang’s data, reported in terms of p-values,

are transformed to Pearson correlation coefficients to be consistent with other data. The

transformation is done based on the significance of a correlation coefficient test [124].

In addition, we applied the data given in [125] and [126] and computed the correlation

between the matched host genes and intronic miRNAs;we refer this method as Rad. In

order to compute the correlation between the expression profiles of miRNAs and mRNAs

in Rad data, we analyzed the data collected in [20]. Ritchie et al. analyzed miRNAs

expression data cloned by Landgraf, et al. [126]. After downloading their data and

processing them we obtain the expressions of 117 human miRNAs and the expression of

22283 genes over 117 samples. We then computed the correlation between all miRNA-

mRNA pairs.

In this way, we obtain correlation coefficients for 84 host-intronic miRNA pairs from

five different datasets. We expect that co-expressed host-miRNA pairs show strong corre-

lation in at least two of these five datasets. The scatter plots (Fig. 3.14) of the correlation


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Hosts whose intronic miRNAs have promoters

Good surrogates Bad surrogates

EV

L--m

iR-3

42LP

P--

miR

-28

CTD

SP

2--m

iR-2

6aC

TDS

P1-

-miR

-26b

ME

ST-

-miR

-335

C9o

rf3--

miR

-24

GA

BR

E--

miR

-452

PD

E2A

--m

iR-1

39G

PC

1--m

iR-1

49S

LIT3

--m

iR-2

18S

LIT2

--m

iR-2

18

Pea

rson

Cor

rela

tion

Coe

ffici

ent

Figure 3.13: Pearson correlation coefficients averaged over five correlation datasets. Only thosehost-intronic miRNAs pairs which are significant (P < 0.05) in at least two datasets and overlapwith our host gene list are considered. The hosts marked with a yellow triangle contain intronicmiRNAs with predicted independent promoters.

data however show that the data are highly non-overlapping and somehow inconsistent,

suggesting that solely relying on correlation data may not be sufficient to declare a host-

intronic miRNA pair co-transcribed.

3.4 Discussion

InMiR models the combinatorial effect of miRNAs using a simple and biologically plausi-

ble linear model. Because we use ordinary linear regression for target prediction, InMiR

is fast and easy to update to incorporate new mRNA expression data. We used data


-0.2 0 0.2 0.4 0.6 0.8-0.2

0

0.2

0.4

0.6

0.8

1

1.2a

1(Rad)

2

0 0.2 0.4 0.6 0.8 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2b

1(Linag)

2

0.4 0.5 0.6 0.7 0.8 0.9 1-0.4

-0.2

0

0.2

0.4

0.6

0.8

1c

1(Wang)

2

-0.4 -0.2 0 0.2 0.4 0.6 0.8 1

0.4

0.5

0.6

0.7

0.8

0.9

1d

1(Ruike)

2

2 :Linag

2 :Wang

2 :Ruike

2 :Baskerville

Figure 3.14: Scatter plots of five correlation datasets. Scatter plots of five correlation datasets(Table S4). (a) the scatter plot of Rad’s data versus Liang’s, Wang’s, Ruike’s, and Baskerville’sdata. (b) the scatter plot of Liang’s data versus Wang’s, Ruike’s, and Baskerville’s data. (c)the scatter plot of Wang’s data versus Ruike’s and Baskerville’s data. (d) the scatter plot ofRuike’s data versus Baskerville’s data.

from ∼1,500 gene expression arrays to predict interactions in human between 57 in-

tronic miRNAs and 3,864 potential targets. InMiR can also be readily applied to other

species beside human because intronic miRNAs constitute a large portion of the miRNA

complement of a variety of species (Fig. 3.15).

Unlike previously described methods, InMiR does not assume that all host genes have

expression levels that are equally good surrogates. The set of host genes predicted by

InMiR to be bad surrogates is enriched for those with predicted intronic promoters as


0

100

200

300

400

500

600

700

800

900

1000

Species

Num

ber

of m

iRN

As

Homo

sapie

ns(h

sa)

Bos ta

urus

(bta

)

Pan tr

oglod

ytes(

ptr)

Mus

mus

culus

(mm

u)

Pongo

pyg

mae

us(p

py)

Gallus

gall

us(g

ga)

Mac

aca

mula

tta(m

ml)

Danio

rerio

(dre

)

Equus

caba

llus(

eca)

Ornith

orhy

nchu

s ana

tinus

(oan

)

Ciona

intes

tinali

s(cin

)

Rattu

s nor

vegic

us(rn

o)

Canis

fam

iliaris

(cfa

)

Taenio

pygia

gut

tata

(tgu)

Xenop

us tr

opica

lis(x

tr)

Sus sc

rofa

(ssc

)

Caeno

rhab

ditis

elega

ns(c

el)

Droso

phila

mela

noga

ster(d

me)

Mon

odelp

his d

omes

tica(

mdo

)

Tetra

odon

nigr

ovirid

is(tn

i)

intergenicintron3UTRexon

38%

55%

2%5%Homo sapiens

Figure 3.15: Intronic miRNAs comprises a significant portion of identified miRNAs in otherspecies. Stack bars showing the number of miRNAs located in exon (brown), 3’UTR (yellow),intron (cyan) , and intergenic regions (blue) in 20 species for which more than 100 microRNAshave been detected. Data are retrieved from miRBase (v.15).

well as having a larger number of microRNA target sites in their 3’ UTRs.

As shown in Fig. 3.16, our observations suggest at least three types of regulatory

relationships between host genes and their intronic microRNAs: (a) an intronic miRNA

and its host gene are transcribed from the same promoter; the mature miRNA is then

processed from intron before or after splicing using Drosha or independently (mirtrons)

and the subsequent steady-state expression levels of the host and intronic miRNA are

highly correlated (Fig6.a); (b) an intronic miRNA has its own promoter and is transcribed

independently from the host gene at least some of the time (Fig 6.b); (c) the intronic


Repressed mRNA

miRNA

miR

NA

P

roce

ssin

g

miRNA mRNA

RISC

UTR

Intron

Exon

Transcriptional start site

Co-expressed host and intronic miRNA

Independent-transcribed and expressed intronic

miRNA

Co-transcribed host and intronic miRNA but not co-expressed

miRNA mRNA

Ta

rgte

d b

y a

miR

NA

C

A B

miR

NA

P

roce

s sin

g

miR

NA

P

roce

s sin

g

Figure 3.16: Regulatory mechanisms. Three possible scenarios for the transcription and ex-pression of a host and its intronic miRNA.

miRNA and host are transcribed from the same promoter but the post-transcriptional

regulation of the host gene expression levels is different than those of the miRNA (Fig

6.c). For example, a host gene could be down-regulated by its own intronic miRNA; we

found three self-regulated hosts, all of which were predicted as bad surrogates by InmiR

(Fig. 3.17) or host genes could be down-regulated by other co-expressed miRNAs.

The host gene / intronic miRNA interactions that we observe suggest a variety of new


Figure 3.17: The host genes targeted by their own intronic miRNAs. The host genes in ourdataset which are targeted by their own intronic miRNAs. All of these hosts are predicted tobe bad surrogates.

regulatory mechanisms. For example, tightly coupled host gene and intronic miRNA ex-

pression could support a rapid ”biological switch” in cellular state in which host gene

expression also expresses an intronic miRNA that immediately down-regulates genes ex-

pressed in the competing state (Fig. 3.18). Our observation raise a number of interesting

questions. Are intronic miRNAs with their own promoter ever expressed from the host

gene’s promoter? How is this decision regulated? How does the independent transcrip-

tion of an intronic miRNA affect host gene transcription? Does the processing of intronic

miRNA interfere with splicing? This may depend on whether Drosha cleaves the pre-

miRNA before or after splicing. Kim and Kim [71] speculated that both mechanisms may

occur but no conclusive results can be drawn yet. Answers to these not well-understood

mechanisms provide a clearer picture of intronic miRNA biogenesis.


Gene 2 is being expressed with

its intronic miRNA

time

Repressed mRNA

miRNA2 mRNA2

Tar

gtin

g m

RN

A2

by

miR

NA

1

mRNA1Gene 1 is already expressed

mRNA1 mRNA2

Gene 1 is repressed and

Gene 2 is expressed

Host and its intronic miRNA cooperatively resemble a Biological Switch

miR

NA

P

roce

ssin

g

Figure 3.18: Tightly coupled host gene and intronic miRNA expression could support a rapid”biological switch” in cellular state in which host gene expression also expresses an intronicmiRNA that immediately down-regulates genes expressed in the competing state.

Chapter 4

BayMiR: inferring evidence for

endogenous miRNA-induced gene

repression from mRNA expression

profiles

4.1 Introduction

In the previous chapter, we introduced InMiR, a computational method for predicting the

target genes of intronic miRNAs. Although we showed that InMiR can successfully pre-

dict the targets of intronic miRNAs, many of miRNAs are not intronic and many intronic

miRNAs are not co-expressed with their host genes, a prerequisite for using host genes

as surrogates of intronic miRNAs in the InMiR model. Therefore, we need a prediction

method that works for all miRNAs and independent from the host genes. In this chap-

ter, we introduce BayMiR, a new computational method, that predicts the functionality

of potential miRNA target sites using the activity level of the miRNAs inferred from

genome-wide mRNA expression profiles [127]. For each mRNA-miRNA pair, BayMiR

53

Chapter 4. BayMiR: a computational miRNA target prediction method54

computes an “endogenous target repression” score that identifies the contribution of each

miRNA in repressing the target mRNA expression in presence of other targeting miR-

NAs that are active in the same cellular contexts. We also found that validated miRNA

targets exhibit high expression variability, suggesting that an index of mRNA expression

variation can also be used as another score for predicting miRNA targets. We bench-

marked BayMiR, the expression variation index, Cometa, and the TargetScan “context

scores” on two tasks: predicting independently validated miRNA targets and predicting

the decrease in mRNA abundance in miRNA overexpression assays. BayMiR performed

better than all other methods in both benchmarks and, surprisingly, the variation index

performed better than Cometa and some individual determinants of the TargetScan con-

text scores. Furthermore, BayMiR predicted miRNA target sets are more consistently

annotated with GO and KEGG terms than similar sized random subsets of genes with

conserved miRNA seed regions. We have thus refined the functional classification of miR-

NAs by assigning them function based on enrichment of their BayMiR predicted targets

in KEGG pathways. Our work suggests that modeling multiplicative interactions among

miRNAs is important to predict endogenous, miRNA-induced decreases in steady-state

mRNA abundance.

BayMiR infers miRNA activity levels based on the expression profiles of its putative

targets (predicted on the basis of conserved seed matches) and then it refines these

target predictions using the regression model. We also found that expression variability

is significantly higher among mRNAs with more miRNA target sites and, furthermore,

that it can be used to identify more likely targets. Accordingly, we used the variance

of gene expression levels across a wide range of samples including different cell types,

cell lines, and disease/healthy tissues as another mRNA-miRNA scoring scheme. These

scores are called “gene variation” index.

BayMiR analysis was conducted on 1,539 human miRNAs and the expression levels of

13,303 genes measured on 5,372 microarray experiments and predicts that approximately


60 % of miRNA-mRNA duplexes with matched conserved targets sites have detectable

down-regulation signal on gene expression. We evaluated and compared the efficacy of the

proposed scores with eight TargetScan scores (a collection of most important sequence

based features) as well as Cometa scores (an mRNA expression based miRNA target

prediction method) using over-expression miRNAs experiments, validated targets, and

GO and KEGG enrichment analysis. Using these benchmarks, we found the BayMiR

scores consistently outperform both the sequence and expression scores and identify to

what extent down-regulated genes on a global set of microarrays are under control of

miRNAs.

4.2 Results

4.2.1 BayMiR method

BayMiR (Fig. 4.1) calculates the degree to which mRNA down-regulation inferred from a

large set of microarrays can be explained by inferred miRNA activity. BayMiR makes this

prediction by integrating sequence and expression evidence. Because many targets are

under the control of multiple miRNAs [20, 51, 106, 120], BayMiR applies a linear model

that relates the target expression vector (measured variable) to a weighted combination

of the miRNA activity vectors (regressor variables). BayMiR infers the activity vector of

a given miRNA by averaging the normalized expression vectors of its predicted mRNA

targets based on sequence-based prediction methods. These miRNA activity vectors are

then used as regressors in a Bayesian linear regression model of the “down-regulation”

expression vector of each mRNA. The resulting regression coefficients of each miRNA are

interpreted as the strength of miRNA-mediated repression of the target mRNA.

We also considered the variability in gene expression of a target mRNA as a deter-

minant to distinguish functional and non-functional targets of a given miRNA. The gene

variation index for each mRNA is computed as the variance of gene expression levels


across all samples.

Each expression vector consists of the transcriptional abundance of the target in one

of 392 biological samples collected from 5,372 microarray experiments. We determine the

coefficients of the regression model using a penalized likelihood approach called elastic

net regression [128](see 4.3.1) modified to assign only positive coefficients. By using

this regression model, each sequence-predicted miRNA-mRNA interaction is assigned

one coefficient; this coefficient represents how much the inferred activity profile of that

miRNA contributes to predicting that mRNA’s “down-regulation” profile (see 4.3.1) when

considering the activity profiles of all other miRNAs predicted to target the mRNA. We

call these coefficients “BayMiR scores” and interpret a zero BayMiR score as representing

a lack of evidence in the expression data for regulation of the mRNA by that miRNA.

4.2.2 BayMiR identifies highly repressed targets on miRNA

over-expression assays

To evaluate whether the BayMiR scores reflect the strength of miRNA-mediated repres-

sion of mRNA targets, we measured the consistency between the BayMiR scores and

relative down-regulation of targets in a set of miRNA over-expression experiments. One

expects high scoring targets to be down-regulated more in miRNA over-expression exper-

iments. We note that a similar metric has previously been used to evaluate the efficiency

of TargetScan scores [7, 18], and that this set of miRNA over-expression assays were

not used in BayMiR to obtain the scores; thus, we are not influencing the results of our

evaluation by either selecting bias metrics or by evaluating our model on the training

data. We downloaded the data collected by Khan et. al [39] in which 23 miRNAs were

transfected into seven different cell types and the log-fold change of the expression levels

of mRNAs were measured. To examine that the degree to which our scores can predict

the log-fold change of mRNAs in the miRNA over-expression arrays, for each score, we

binned mRNAs into five bins based on their scores and computed the mean of mRNA


5,372 samples

wKmiRNA activity vectors

miRNAK

Target expression vector yg

w2w1

......

identifying the target set of miRNA1 using

sequence determinants

averaging the expression vectors of the target set of miR1miRNA1

hg =[h1,h2,...,hK]miRNAs-mRNAg scores

...

Bayesian linear regression

yg = h1w1+h2w2+...+hKwK+eg

mRNA Expression Data Set

13,303 mR

NA

s

Figure 4.1: BayMiR Method. Flowchart of the BayMiR algorithm. For each miRNA, BayMiRfirst identifies the set of targets based on the presence of conserved complementary sites tothe seed region of the miRNA in the 3’UTR of the target. Next, for each miRNA, BayMiRextracts the mRNA expression vectors associated with the selected targets from the mRNAgene expression data set, and averages them to obtain the miRNA activity vector. ThesemiRNA activity vectors are used as regressors in a Bayesian linear regression model to explainthe down-regulation in the expression level of the target. Finally, BayMiR infers scores (theregression coefficients) using a penalized likelihood method called elastic net regression. Eachscore indicates the strength of miRNA- mediated repression on the target genes.

log-fold changes in each bin. We observed that negative log-fold repression levels decrease

consistently as scores decrease for both determinants (Fig. 4.2.(top)). In total, 3,867 out

of 10,125 mRNAs are down-regulated in the miRNA over-expression experiments. We

then asked if our scoring schemes can detect repressed targets better than the individ-

ual components of the TargetScan context score[7]. When comparing negative mean

log-fold changes for messages whose scores were greater than the median score for the

corresponding miRNA, BayMiR scores outperforms all TargetScan scores, even the con-

text+score which is a combination of all individual TargetScan scores (Fig. 4.2.(middle)).

In addition, when we combined BayMiR scores and the TargetScan context+score the


performance further improved (Wilcoxon-Mann-Whitney test: P < 0.001), indicating

that BayMiR can augment the TargetScan scoring system to further improve the per-

formance. Target site conservation is another scoring scheme used by TargetScan, so we

also compared BayMiR scores with conservation scores for all conserved target sites of

all conserved miRNA families and found similar improvements (Fig. 4.2.(bottom)). Our

analysis also shows that the gene variation score was a better predictor of log-fold change

than seed pairing stability, relative location of seed match in the 3’ UTR, and target

abundance; however, it is worse than the other components of the context score on this

assay (Fig. 4.2(middle)).

High-scoring BayMiR targets are enriched for validated targets

To test whether the set of experimentally validated targets are enriched among high-

scoring BayMiR targets, we measured the significance of overlap between the targets

with scores greater than the median and the experimentally validated targets retrieved

from TarBase [129]. Enrichment using the hyper-geometric test showed that the validated

targets are enriched in the sets of high-scoring genes both for BayMiR and gene variation

predicted targets, P < 10−5 and P < 10−4 respectively. A cumulative distribution

analysis is also shown in Fig. 4.3. Together these observations support that the hypothesis

that repressed targets under the endogenous conditions are more likely to be functional

targets.

BayMiR predicts miRNA-induced repression better than Cometa

Next, we used the same evaluation strategy to compare BayMiR scores with an mRNA-

miRNA scoring method which also uses large-scale gene expression data. Recently, Gen-

narino et al. [55] showed that the target set of a miRNA tend to be co-expressed and

based on this property they proposed Cometa, a computational method that scores each

sequence-based miRNA target prediction based on how correlated it is with other pre-


0.4

0.8

1.4

0

0.2

0.6

1.0

1.2

BayMiR

gene varia

tion

0-20 20-40 40-60 60-80 80-100

0

0.2

0.4

0.6

0.8

1.0

conte

xt+ sc

ore+B

ayMiR

BayMiR

score

conte

xt+ sc

ore

site t

ype

targe

t abu

ndan

ce

0

0.2

0.4

0.6

0.8

1.0

BayMiR

Conse

rvatio

n

Avg

fold

dec

reas

e in

abu

ndan

ce (l

og2)

score percentage

Avg

fold

dec

reas

e in

abu

ndan

ce (l

og2)

Avg

fold

dec

reas

e in

abu

ndan

ce (l

og2)

mean log-fold change for mRNAs whose scores > median of all mRNA scores

mean log-fold change for targets in the transfection experiments

seed

pairin

g stab

ility

local

AU

posit

ion co

ntribu

tion

gene

varai

tion

3' UTR co

ntribu

tion

mean log-fold change for mRNAs whose scores > median of all mRNA scores

Figure 4.2: (top) mRNAs in the over-expression miRNA assays are grouped into five bins basedon their BayMiR and gene variation scores; the mean log-fold change of the mRNAs in each binis plotted in as a bar. There are two groups of bars; the left- and right-hand groups correspondto BayMiR and gene variation, respectively. (middle) Comparing BayMiR and gene variationscores with seven sequence scores from TargetScan. Each bar represents the negative meanlog-fold change for mRNAs whose scores are greater than the median of all mRNA scores forthe selected determinant in the miRNA over-expression assays. The most left-hand group isobtained by combining the context+ scores with BayMiR scores. The dashed line shows themean log-fold change for all targets in the miRNA over-expression assays (bottom) ComparingBayMiR scores with the conservation scores as measured by TargetScan. The conservationscores are given only for the targets with conserved target sites complementary to the seedregions of the conserved miRNA families. Error bars indicate 95% confidence intervals for theestimated means.


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

BayMiR Scores

P< 10-8

ValidatedAll

0 1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Gene variation

P< 10-4

ValidatedAll

Figure 6:

27

CD

F

CD

F

Figure 4.3: Cumulative distribution of scores for the validated targets. Validated targets areassigned higher BayMiR scores and gene variation scores compared to the other putative targets.Shown are the cumulative distributions of BayMiR (left plot) and gene variation scores (rightplot) scores for validated targets (blue) and all putative targets (red).

dicted targets of the miRNA. Examining the down-regulated targets on the miRNA

over-expression assays shows that negative mean log-fold expression changes for targets

selected by our scoring schemes are significantly higher than those selected by Cometa

scores (P < 10−40, Fig. 4.5). Moreover, our methods’ high scoring targets are significantly

more down-regulated compared to Cometa high scoring targets (P < 10−60 Fig. 4.4) on

the over-expression assays. Although Cometa targets are also enriched for validated tar-

gets, this enrichment is smaller than BayMiR scoring targets (P < 0.01 v.s. P < 10−5).

BayMiR target sets have more consistent GO-BP and KEGG annotations

Many miRNAs participate in the coordinate regulation of biological processes [130]; as

such, we should expect that, in general, better target prediction methods would generate

miRNA target sets that have higher enrichment[109]. To test whether BayMiR predicted

targets are more consistently annotated with GO (release 2012-2-19 ) and KEGG (release


0

0.2

0.4

0.6

0.8

1.0

BayMiR

Gene v

ariati

on

Cometa

Avg

fold

dec

reas

e in

abu

ndan

ce (l

og2)

mean log-fold change for mRNAs whose scores > median of all scores

mean log-fold change for targets in the transfection experiments

Figure 4.4: BayMiR high scoring targets are more down-regulated in miRNA over-expressionassays than Cometa high scoring targets. Each bar represents the mean of negative log-foldchange after miRNA over-expression for genes with scores greater than median.

2012-02-14)terms than TargetScan targets, we used Fisher’s exact test with an FDR mul-

tiple test correction (see method and materials) to score the enrichment of 1,233 GO-BP

terms and 259 KEGG pathways within the target sets of each of 1,264 miRNA families.

We found a nearly three-fold increase in enriched terms and pathways (FDR < 0.1)

within BayMiR-predicted target sets compared to equally-sized random subsets of Tar-

getScan (31,976 vs 11,890, P < 10−200). Examination of the enriched GO-BP terms and

KEGG pathways revealed a wide diversity of biological processes regulated by miRNAs

(Table S1, FDR < 0.1 and Table S2, FDR < 0.1). We found that 35 % of miRNAs that

have BayMiR target sets are enriched for the GO term “regulation of expression” sug-

gesting that miRNAs have substantial influence in gene regulation through their control

of other gene regulators.


-2 -1.5 -1 -0.5 0 0.5 1 1.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fold Change (log2)

Cum

ulat

ive

fract

ion

BayMiRGene variation

Cometa

Figure 4.5: Comparing BayMiR and Cometa. BayMiR high scoring targets are more down-regulated in miRNA over-expression assays than Cometa high scoring targets. The cumulativedistribution of log-fold change for high-scoring mRNAs; blue, red, and black represent graphsassociated with BayMiR, gene variation, and Cometa.

We also searched for miRNAs with known functions among the miRNAs enriched in

our pathway analysis. A list of miRNAs with experimentally supported functions among

their enriched pathways are given in Table S3. Notably the miR-17 family is frequently

seen in the list. This family has been extensively studied and shown to play an important

role in many cancer-related processes and pathways [80, 81], and references in Fig. 4.17.

Enrichment map of the top 30 most frequent enriched GO-BP/pathways are depicted

in Fig. 4.7. When we examined the mRNAs in KEGG pathways targeted by miRNAs, we

found that although there are extensive co-regulation of mRNAs by multiple miRNAs, a

handful of miRNAs appeared to be responsible for most of the regulation. For example,

in the WNT signaling pathway, five miRNAs target 32 out of 46 genes predicted to be

targeted by any of the 45 miRNAs with targets in this pathway (Fig. 4.8). Similarly,

the 106 genes in “Pathways in cancer” are targeted by 83 miRNAs but only 10 of these


miRNAs Pathways PMID

miR‐17/20ab/93/106ab/519d Pathways in cancer 16461460;18485879;


miR‐124/124ab/506 Axon guidance 18619591;

miR‐138/138ab Pathways in cancer 18201269; 20332227;

miR‐155 T‐cell‐receptor signaling17463289;19877012;


miR‐15abc16abc/195/424/497 p53 signaling pathway 19626115;


miR‐17/20ab/93/106ab/519d MAPK signaling pathway18700987;

miR‐17/20ab/93/106ab/519d p53 signaling pathway 19696742;

miR‐200bc/429/548a Pathways in cancer 19671845;18829540;

miR‐200bc/429/548a Pathways in cancer 17804704;18376396;

miR‐1ab/206/613 Pathways in cancer 18593897;19684618 ;


miR‐25/32/92abc/363/367 Phosphoinositide signali20388916;

miR‐302abcde/372/373/520 Pathways in cancer 17695719;18193036;

miR‐29abcd Pathways in cancer 19247375;19818597;

miR‐29abcd Focal adhesion 19956414;

let‐7/98/4458/4500 bladder cancer 21993544;

miR‐29abcd Small cell lung cancer 17890317;

miR‐302abcde/372/373/520 Cell cycle 18328430;

miR‐133abc Apoptosis 17715156;

miR‐17/20ab/93/106ab/519d Cell cycle 18700987;

miRNA Family GO‐BP terms PMID

miR‐17/20ab/93/106ab/519d G1‐S transition of mitotic cell cycle 19153141;20404090;

miR‐17/20ab/93/106ab/519d Cell‐cycle phase 18212054;

miR‐155 TGFb receptor signaling pathway 19701459;

miR‐155 Immune system development 17463290;18291670

miR‐17/20ab/93/106ab/519d G1/S transition of mitotic cell cycle 18836483;18700987

miR‐17/20ab/93/106ab/519d Apoptosis 19696742;17384677

miR‐181abcd/4262 Regulation of apoptosis 20145152;

miR‐17/20ab/93/106ab/519d Apoptosis 19696742;17881434;

miR‐17/20ab/93/106ab/519d Cell‐cycle process 17881434;18836483

miR‐17/20ab/93/106ab/519d G1‐S transition of mitotic cell cycle 18836483;

miR‐221/222/222ab/1928 Regulation of cell‐matrix adhesion 20110463;

miR‐221/222/222ab/1928 Induction of apoptosis 17616664;19730150

miR‐224 Regulation of apoptosis 18319255;

miR‐27abc/27a‐3p Muscle cell differentiation 20388916;

miR‐29abcd Extracellular matrix organisation 18390668;

miR‐34abc/449ab cell cycle arrest 17554337;

miR‐302abcde/372/373/520 regulation of cell cycle 18328430;

miR‐124/124ab/506 neuron differentiation 17679093,17344415

miR‐125ab/4319 apoptosis 19293287;

miR‐125ab/4319 neurogenesis 16227573;

miR‐133abc regulation of apoptosis 17715156;

miR‐146ac/146b‐5p immune response 16885212;

miR‐15abc16abc/195/424/497 regulation of cell cycle 17242205,18701644

miR‐17/20ab/93/106ab/519d regulation of cell cycle 18700987;

miR‐221/222/222ab/1928 hemopoiesis 16330772;

Figure 12:

33

Figure 4.6: Validated KEGG pathways. List of miRNAs with proposed functions found in ourenriched KEGG list; the third column gives the Pubmed IDs of the references.

miRNAs collectively target more than 75% these genes (Fig S.3). Although some of this

consolidation of targeting can be explained with a large variability in number of mRNA

targets per miRNA, there is significantly more consolidation than we would expect by

chance (Fig. 4.10, P < 10−19)

These observations suggest that important miRNA regulators of specific biological

processes can be identified in silico through gene set enrichment analysis of BayMiR

target sets.

4.2.3 miRNA activity and expression profiles are significantly

correlated

To test if miRNA activities obtained using the BayMiR procedure are correlated with

the miRNA expression profiles, we downloaded the miRNA expression data from the


Figure 4.7: Enrichment map for top 30 most frequent KEGG pathways; each node indicates apathway; there is an edge between two pathways if they share more than ten miRNAs; the edgesthickness is proportional to the number of shared miRNAs; the size of each node is proportionalto the number of miRNAs enriched in the corresponding pathway. Note that we say a miRNAenriched in a pathway when the predicted targets of the miRNA are over-represented in thepathway based on a statistical test.

mimiRNA repository [57] and computed the correlation between matched activity and

expression vectors. After excluding miRNA expression data that are not consistent across

multiple resources (according to P > 0.05 reported in the mimiRNA resource) and map-

ping the biological samples of the miRNA expression data to our biological groups we

obtained paired matches for 48 miRNAs. Interestingly, we found that 96 % of the pairs

(46 out 48) have the Pearson correlation coefficients greater than 0.35 compared to 4%

positive correlation obtained from a similar analysis but with the permuted activity vec-

tors (P < 0.05 and Table S.4). This correlation analysis shows that miRNA activities

inferred from the mean of inverse expression of their targets are highly correlated with

expression data for those miRNAs.


Figure 4.8: WNT signaling pathway: 32 targets of 5 miRNAs are involved in the pathway(red boxes). 14 mRNAs are targeted by the remaining miRNAs are colored in yellow; and23 mRNAs involved in the pathway were excluded from the BayMiR target list since theirexpression variabilities across arrays were very low (white boxes). The miRNA family IDs: miR-518a-5p/520d-5p/524-5p,miR-556-3p,miR-4514/4692,miR-548aeajamx ,miR-135ab/135a-5p.

4.2.4 mRNAs harboring miRNA target sites near the both ends

of the 3’ UTR have higher endogenous down-regulation

signals

To investigate any association between endogenous target repression scores provided by

BayMiR and sequence and gene variation determinants, we measured the correlation

between the scores of all paired determinants(Fig. 4.11). The heat map shows that

BayMiR scores correlate most highly with the position contribution scores. In addition,

when we ranked all mRNA-miRNA pairs based on their BayMiR scores, the top 50

percentile of the ranked list have higher position contribution scores than the bottom

50 percentile (P < 10−200, Wilcoxon-Mann-Whitney test and Fig. 4.12). The position

contribution scores provide estimate of expected repression in terms of the distance of


Figure 4.9: KEGG “Pathways in cancer”: 68 targets of 10 miRNAs are involved in thepathway (red boxes). 38 genes targeted by the other miRNAs are colored in yellow; and62 genes involved in the pathway were excluded from the BayMiR target list since theirexpression variabilities across arrays were very low (white boxes). The miRNA family IDs:miR-17/17-5p/20ab/20b-5p/93/106ab/427/518a-3p/519d,miR-548ah/3609,miR-4729,miR-203,miR-548p,miR-3647-3p,miR-300/381/539-3p,miR-142-5p,miR-545,miR-125a-5p/125b-5p/351/670/4319’

targets sites from the both end of the 3’ UTR; target sites near to the ORF or the

poly(A) tail are more effective [7] and more conserved than those in the middle of the

3’ UTR [131]. To further investigate this, we located 1,567,294 conserved target sites

matched to the seed region of 1,032 miRNAs on the 3’ UTR of 17,840 mRNAs retrieved

from TargetScan 6.2. The start position of each target site was divided by the length

of the 3’ UTR to obtain the relative position of miRNAs on the 3’ UTRs, denoted by

0 < LmiRNA < 1. We found that target sites located on the both end of 3’ UTRs

(LmiRNA < 0.25 or LmiRNA > 0.75) are assigned higher BayMiR scores than those on the

middle (P < 10−200, Wilcoxon-Mann-Whitney test). Furthermore, we found that target

sites located in the terminus close to the poly(A) tail (LmiRNA > 0.75) are assigned higher

BayMiR scores than to those located on the other terminus (LmiRNA < 0.25, P < 10−5,

Wilcoxon-Mann-Whitney test). Poly(A) shortening is known as one of the mechanisms


0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

on average 10% of most targeting miRNAsin each pathway target 88 % of all targets

top N % of miRNAs sorted by target set size

prop

ortio

n of

mR

NA

s ta

rget

ed

KEGG pathway targetsall BayMiR targets

Figure 4.10: miRNA targeting. A small percent of all targeting miRNAs collectively target alarge portion of miRNA targets. The figure shows two cumulative distributions. (red) Propor-tion of set of all mRNAs with BayMiR targets that are covered by union of the target sets ofthe top N% miRNAs (sorted by number of targets) where N increases along the x-axis. (blue)Average of cumulative distributions for all enriched KEGG pathways (n = 108), where eachdistribution was created as per the red line but restricted to targeted mRNA associated withthe KEGG pathways, and their targeting miRNAs.

of mRNA degradation; this mechanism strongly favors the preference of miRNA target

sites near the end of 3’UTR close to the poly(A) tail to recruits mRNA deadenylase

complexes [132]. Together these lines of evidence underline the importance of target site

position in miRNA targeting.

BayMiR scores are also highly correlated with gene variation scores suggesting that

mRNAs with high expression variability are under selective pressure to be miRNA targets.


1.00

0.11

0.40

0.58

0.05

-0.29

0.55

0.12

-0.02

0.11

1.00

-0.05

0.00

0.00

-0.04

0.00

0.00

0.00

0.40

-0.05

1.00

-0.04

-0.03

0.24

0.02

-0.07

0.02

0.58

0.00

-0.04

1.00

-0.01

0.00

-0.11

0.18

-0.02

0.05

0.00

-0.03

-0.01

1.00

-0.20

-0.26

0.04

-0.03

-0.29

-0.04

0.24

0.00

-0.20

1.00

0.07

-0.02

0.02

0.55

0.00

0.02

-0.11

-0.26

0.07

1.00

0.03

-0.01

0.12

0.00

-0.07

0.18

0.04

-0.02

0.03

1.00

0.18

-0.02

0.00

0.02

-0.02

-0.03

0.02

-0.01

0.18

1.00

conte

xt+ sc

ore

3'

UTR

local

AU

posit

ion co

ntribu

tion

targe

t abu

ndan

ce

seed

pairin

g stab

ility

site t

ype

BayMiR

score

gene

varia

tion

context+ score

3' UTR

local AU

position contribution

target abundance

seed pairing stability

site type

BayMiR score

gene variation

Figure 4.11: The heat map shows the Pearson correlation coefficients between each pair ofnine determinants. The correlation coefficients for pairs labeled by 0 are not significant (i.e.,P > 0.05).

4.3 Materials and Methods

4.3.1 BayMiR model

BayMiR applies the following linear model to relate the changes in the log-transformed

expression level of mRNAs to the activity level of miRNAs:

yiM×1

= WM×K

hiK×1

+ εM×1

where yi ∈ RM denote the change in the expression level of the ith mRNA measured

across M samples and is obtained by subtracting the mean; W = [wm,k]M×K denote the

activity levels of K miRNAs across M samples, and each element of hi ∈ R+K represents

the contribution of the corresponding miRNA in down-regulating the expression of the


ith mRNA; ε models error. In our problem K = 1, 252; M = 369 and i = 1, . . . 13, 000.

In this linear equation, yi and W and are observed; hi is the desired unknown variable.

BayMiR infers h by maximizing the posterior probability of h given y and W:

h = arg max log p(h|y,W).

Using Bayes’s rule

h = arg max log p(h|y,W) = arg maxp(y,h,W)∫

hp(y,h,W)dh

= arg maxp(y|h,W)p(h)∫

(p(y|h,W))p(h)dh.

Since the denominator is not a function of h,

h = arg max p(y|h,W)p(h)

where

p(y|h,W) =1

(2π)K2 σKn

exp(−1

2

(∑

m(ym −wm,:h)2)

σ2n

)

We assume that the prior probability p(h) is a compromise between Gaussian and Laplace

distributions given by

pα1,α2(h) = C(α1, α2) exp(−α1|h|2 − α2|h|1)

where | · |2 and | · |1 denote the norm one and two, respectively. Since h appears only in

the argument of exponential functions in the above probabilities and since exponential

function is monotonic, maximizing the posterior probability is equivalent to minimizing


the expression in the argument of exponential function; hence

h = arg min1

2σn

∑m

(ym −wm,:h)2 + α1|h|2 + α2|h|1) (4.1)

Multiplying this expression by σn and let λ1 = σnα1 and λ2 = σnα2, this Bayesian

inference problem can be written in form of a penalized linear regression optimization

given by:

h = arg min1

2

∑m

(ym −wm,:h)2 + λ1

∑k

|hk|+ λ2

∑k

h2k (4.2)

where λis are two tuning parameters and wm,: is a row vector representing the expression

activity of miRNAs in the mth sample.

We solve this optimization using the coordinate-descent method [128] in which, the

objective function is partially optimized with respect to each individual coefficient in an

iterative manner described as follows.

The above equation can be rewritten as

f(h) =1

2

M∑m=1

(ym −∑k 6=j

wm,khk − wm,jhj)2 + λ1

∑k 6=j

|hk|+ λ1|hj|+ λ2

∑k 6=j

h2k + λ2h

2j (4.3)

We minimize f(h) with respect to hj by setting derivative of f(h) to zero. If hj > 0

∂f(h)

∂hj= −

M∑m=1

wm,j(ym −∑k 6=j

wm,khk) +M∑m=1

w2m,jhj + λ1 + λ2hj (4.4)

Therefore

hj =

∑M

m=1(ym−∑K

k 6=j wnm,khk)wn

mj−λ1∑Mm=1 w

n2mj+λ2

,∑M

m=1(ym −∑K

k 6=j wnm,khk)w

nmj > λ1

0, otherwise.

(4.5)


likewise, if hj ≤ 0

hj =

∑M

m=1(ym−∑K

k 6=j wnm,khk)wn

mj+λ1∑Mm=1 w

n2mj+λ2

,∑M

m=1(ym −∑K

k 6=j wnm,khk)w

nmj < λ1

0, otherwise.

(4.6)

In a compact form, the above expressions can be rewritten as

hj =S(∑M

m=1(ym −∑K

k 6=j wm,khk)wmj, λ1

)∑M

m=1w2mj + λ2

(4.7)

where S(x, t) is the soft threshold operator defined as sign(x)(|x| − t)+ where (y)+ = 0

if y < 0 and (y)+ = y if y ≥ 0 [133].

The optimization is based on pathwise coordinate descent where we solve a sequence

of scalar minimization subproblems given in the following routine:

Algorithm(Pathwise coordinate analysis)

while c < K and itr < maxitr

for j = 1 : K

hj ←S

(∑Mm=1(ym−

∑Kk 6=j wm,khk)wmj ,λ1

)∑M

m=1 w2mj+λ2

if |holdj −hjhold

| < ε⇒ c← c+ 1

holdj ← hj

end

itr⇒ itr + 1 and c = 0

end

Since miRNA and target mRNA expression data are anti-correlated [73], for each

miRNA, BayMiR uses the negative mean of target expression levels as an estimate of the

activity level of the miRNA as follows:

wk = − 1

Nk

Nk∑i=1

yi where Nk : number of target genes for kth miRNA (4.8)


and then each activity vector is normalized wk ← wk

‖wk‖. As such, the activity of the

miRNA will be deemed to be positive when its sequence-predicted targets are below

their mean expression level. BayMiR considers a gene as a potential target of a miRNA

if there is a complementary conserved match sites to the seed region of the miRNA.

We tested to see if BayMiR suffers from over-fitting. We divided the biological samples

into training (340 samples) and test (28 samples) sets and predicted the scores using only

the training data. We then used the predicted scores to estimate the gene expression

profiles of the test set and compared it with original test data. Fig. 4.13 illustrates the

training and test errors versus different values of penalties for training and test data.

The difference in prediction error between training and test data is about 0.2, confirming

BaymiR can predict new profiles with reasonable accuracy. In order to see how well

predicted profiles approximate the actual profiles, we plotted the actual down-regulated

profiles along with the predicted profiles for 9 randomly selected genes (Fig. 4.14) . We

note that BaymiR considers only down-regulated genes as potential targets for miRNAs.

These results show that BayMiR does not suffer from over-fitting and can predict targets

that down-regulated in a sample not included in the training data.

4.3.2 Processing mRNA expression Data

The mRNA expression data were downloaded from ArrayExpress Atlas repository at

EMBL-EBI [134], available at www.ebi.ac.uk/gxa/experiment/E-MTAB-62 . The

data consists of 5372 samples profiled on HG-U133A array platforms; As described in

[134], the data were normalized and manually labeled into 369 biological groups covering

a wide range of healthy/cancer tissues, conditions, and cell lines. We did the following

processing on the retrieved expression data; all probe sets with no gene symbols were

excluded. The samples belonging to each biological groups were averaged—the samples

within one biological group are highly correlated (ρ > 0.85). An upper/lower threshold

defined by lth = Q2 − 1.5(Q4−Q2) and uth = Q4 + 1.5(Q4 − Q2) respectively, when Q2


and Q4 represent the second and forth quartiles, were specified to detect and modify the

extreme outliers. The outliers were then replaced with lth or uth. The gene symbol list

in both expression and sequence datasets were updated based on the latest release of

the HUGO Gene Nomenclature Committee (HGNC) (Feb.2012) to have consistent gene

symbols.

4.3.3 MiRNA-mRNA interaction analysis

We downloaded the list of 19,055 protein coding gene symbols from HGNC database

and the list of 1,537 miRNA IDs from MiRbase V.19. We then built seven 19, 055 ×

1, 532 binary connectivity matrices based on the mRNA-miRNA interactions given by:

Targetscan V6.1, [6] and TarBase [129]. All miRNAs are grouped into 1,251 miRNA

families as defined by TargetScan—miRNAs sharing the same seed region. Conserved

target sites are also retrieved from the TargetScan repository.

4.3.4 Enrichment analysis

Gene ontology biological process (GO-BP) annotations were downloaded from the Gene

Ontology Website on April 15th 2012. The file contains 14,000 annotations for 15,000

genes. The enrichment analysis was performed using Fisher Exact test. The test was

performed on BayMiR predicted targets of each of miRNA families. The enrichment

pvalues were corrected using Benjamini-Hochberg test[135] and a FDR cutoff equal to

0.1 was chosen to selected significant enrichment categories. The KEGG enrichment

analysis carried out in a similar manner; The list of 253 KEGG human pathways were with

associated genes downloaded from http://www.genome.jp/kegg/; Fisher exact test

was used to find enriched pathways for BayMiR targets of all miRNA families.


4.3.5 Availability of BayMiR and supporting data

The code for BayMiR is available at morrislab.med.utoronto.ca/BayMiR. package in-

cludes scripts and instructions to re-generate BayMiR scores from the “E-MTAB-62” file

and sequence information, however, a pre-computed version of the BayMiR scores are

also uploaded.

4.4 Discussion

Large-scale mRNA expression profiling datasets provide a rich resource to study the

regulatory impact of miRNAs. Here, we showed that the impact of miRNAs on targets is

detectable in normal tissue and unperturbed cell line data. Given a list of miRNAs with

partial complementarity to a particular mRNA, our computational technique, BayMiR,

scores the relative regulatory impact of the miRNA among other predicting targeting

miRNAs. We showed that BayMiR estimates of miRNA regulatory impact better reflect

independent measures of this impact than the TargetScan context scores; furthermore,

we showed that the context scores and BayMiR can be combined to generate even better

estimates.

BayMiR has several features that make it particularly useful for estimating the poten-

tial regulatory impact of a miRNA. BayMiR models the combinatorial effect of multiple

regulatory miRNAs on a single target which is critical, as most mRNAs are likely to

be targeted by multiple miRNAs (Fig. 4.15). BayMiR is fast; its runtime is less than a

minute in the current version (10,345 mRNAs, 1,123 miRNAs and 359 biological groups),

so is easily applied to a subset of or all available gene expression data. Because BayMiR

estimates the activity of miRNAs based on mRNA expression data, there is no need

for matching miRNA expression profiles. As such, BayMiR predictions can be easily

extended when new miRNAs are found and the current version of BayMiR incorporates

all miRNAs retrieved from the latest release of miRBase (v.19).


Combinatorial regulation by multiple miRNAs has been described for particular mR-

NAs [7] and is likely to play a large role in mRNA expression regulation [51]. Indeed,

human 3’ UTRs contain conserved seed matches for on average 33 of miRNAs (median =

16) (Fig. 4.15). This combinatorial regulation may explain the observations that inverse

correlation under endogenous condition between miRNA and mRNA expression does

not provide strong and consistent evidence of targeting [57, 110] and that the impact of

miRNA regulation on mRNA levels can only be seen within the context of other miRNA

regulations [51, 110]. Fig. 4.16 shows a toy example where combinatorial regulation

masks inverse correlation between miRNA regulators and their targets.

There are a large number of other methods [54–56, 110, 136–144] that infer ei-

ther miRNA activity or predict miRNA targets based on the expression levels of their

sequence-predicted targets, however, no method both infers miRNA activity and predicts

miRNA targets while considering the impact of other miRNAs. For example, Cometa at-

tempts to predict miRNA targets, by identifying tight, co-expressed clusters of sequence-

predicted targets[55]; however it doesn’t account for combinatorial regulation by multiple

miRNAs and provides no estimate of miRNA activity. Other methods such as Sylamer

[54], and a number of web-based applications [138–140], identify miRNA seed regions

that significantly enriched in the 3’ UTRs of down-regulated transcripts as a way of

assessing miRNA activity level in a tissue. Sylamer does not however take into ac-

count multiple targeting effect of miRNAs and has not been used to score the individual

miRNA-mRNA pairs. Other methods use paired miRNA-mRNA expression patterns

to augment sequence-based target prediction [40–53]. These methods typically require

paired miRNA and mRNA measurements in a large number of samples to generate reli-

able predictions. This type of paired expression data is however rare and unavailable for

some miRNAs [145]. On the other hand, there is very large amount of mRNA expres-

sion data available for BayMiR. Two intronic miRNA target prediction methods, InMiR

and Hoctar [56, 110] predict the intronic miRNA targets using the expression levels of


their host genes, and subsequently can also incorporate large mRNA expression data.

However, these methods can only be applied to intronic miRNAs and only to those miR-

NAs whose host gene expression is a good surrogate for their activity. Many host gene

expression levels are not good surrogates [110–113].

Our analysis also reveals that mRNAs with more target sites have higher expression

variation when compared to a random subset of genes and expression variance consistently

increases as number of target sites do (P < 10−33, Fig. 4.17). These observations suggest

that mRNAs with highly variable expression levels are much more likely to be regulated

by miRNAs; our finding is consistent with recent reports that genes regulated by miRNAs

have higher expression variability among humans and between human and other primate

species [146].

miRNA transfection experiments have suggested that the degree of mRNA repression

induced by two seeds is equivalent to the product of repression induced by the seeds

individually [7]. We have observed a similar effect. The version of BayMiR described

here implicitly assumes multiplicative interactions because it log-transforms the mRNA

expression levels before performing regression. Applying BayMiR to non-transformed

expression levels assumes additive interactions and this version of BayMiR performs

much worse in our benchmarks (data not shown).

In this chapter, we introduced BayMiR and demonstrated its merits when compared

to two the state-of-the-art miRNA computational prediction methods. BayMiR applies a

more relevant biological model and uses a large collection gene expression data to decipher

the impact of miRNAs on gene expression data. We measured this impact in terms of

endogenous target repression scores for about half a million miRNA-mRNA duplexes.

This new scoring strategy can be used alone or along with other sequence determinants

to predict functional miRNA-mRNA interactions.


-0.5 0 0.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Position contribution scores

Cum

ulat

ive

fract

ion

P< 10-100

BayMiR scores > medianBayMiR scores < median

Figure 4.12: Blue: the position contribution scores of miRNA-mRNA pairs whose BayMiRscores > medianBayMiRscores. Red: the position contribution scores of miRNA-mRNA pairswhose BayMiR scores < medianBayMiRscores.


1 1e−1 1e−2 1e−3 1e−4 1e−5 1e−6 1e−7 1e−8 00.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

λ

ER

MS

Training Biological GroupsTest Bioogical Groups

Figure 4.13: BaymiR predicts down-regulated genes in samples not included in training data.Blue circled line: prediction error on training data and red circled line: prediction error on testdata.

5 10 15 20 25−10

−5

0

samples

inte

nsity

5 10 15 20 25−2

−1

0

samples

inte

nsity

5 10 15 20 25−2

−1

0

samples

inte

nsity

5 10 15 20 25−4

−2

0

samples

inte

nsity

5 10 15 20 25−4

−2

0

samples

inte

nsity

5 10 15 20 25−4

−2

0

samples

inte

nsity

5 10 15 20 25−2

−1

0

samples

inte

nsity

5 10 15 20 25

−4

−2

0

samples

inte

nsity

5 10 15 20 25−2

−1

0

samples

inte

nsity

Figure 4.14: Estimated (red) and actual (blue) expression profiles of nine genes across 28 testsamples.


100 101 102 1030

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

# of distinct seed matches in the 3 UTRs of 14,816 transcripts (log-scaled)

Cum

ulat

ive

Freq

uenc

y

mean = 33median = 16

Figure 4.15: The 3‘ UTR of mRNAs harbor many conserved seed matches. Shown is thecumulative distribution of number of seed matches in the 3‘UTR of 14,816 mRNA transcriptswith at least one miRNA seed match.

mRNA mRNA

miRNA1+miRNA2+miRNA3

miRNA 1

miRNA 2

miRNA 3

Sample

-

+

Exp

ress

ion

leve

l

Cor

r =

-0.

25

Cor

r =

-0.

25P

<0.

75

Cor

r =

-0.

25

P<

0.75

Co

rr =

-0.

25P

<0.

75 C

orr

= -

1P

<2-1

00

Figure 4.16: Example of combinatorial regulation masking inverse correlation. Shown in green isthe expression level of a target gene and in red the expression levels of three targeting miRNAs.The negative correlation of each individual miRNAs with the target is insignificant, but whenconsidered together they explain perfectly the down-regulation impact of miRNAs.


0 50 100 150 200 250 300 350 400 450 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of Targeting miRNAs

CD

F

Pvalue

< 10-33

gene set with high variationsame size random gene set

0 50 100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of Targeting miRNAs

CD

F

P2-quantile

< 10-11

P3-quantile

< 10-17

var > var2-quantile

var > var3-quantile

all genes

Figure 11:

32

Figure 4.17: Gene expression variability increases as the number of target sites increases in the3’ UTR of genes. (top) miRNA targets have high expression variation. (bottom) Red and bluedemonstrate the cumulative distributions of genes whose variance is larger than median and75th percentile, respectively. Dark: cumulative distribution of variances corresponding to allgenes.

Chapter 5

Impact of miRNAs on long

non-coding RNAs

5.1 Background

Long non-coding RNAs (lncRNAs) are 200nt-100,000nt nucleotide-long RNA transcripts

that do not encode protein, most likely due to the lack of open reading frames. Genecode

v14 has annotated 21,271 lncRNA transcripts (about 9,000 genes) located in genic (anti-

sense or intronic) and intergenic regions [147]. lncRNAs are expressed in the cytoplasm

and nucleus, can be spliced or un-spliced, polyadenylated or not, and host many short

RNAs notably snoRNA and miRNAs [148]. In general, lncRNA abundance is low com-

pared to mRNAs, but surprisingly high in some tissues, suggesting lncRNAs may interact

with cell-specific protein complexes to regulate cell-specific gene expression [59]. Indeed,

recent studies have confirmed that lncRNAs participate in mRNA post-transcriptional

regulation by various mechanisms [58, 149] such as protecting or decaying mRNAs [150],

enhancing or inhibiting mRNA translation [61, 62], and perturbing miRNA activities in

the cell [60]. The latter is the subject of this study.

miRNAs are short non-coding RNAs that partially base pair to many regions in the

81

Chapter 5. Impact of miRNAs on long non-coding RNAs 82

genome containing functional elements [101]. Many studies have shown that miRNAs

recruit the ribonucleoprotein complex, called RISC, to repress the regulation of mRNAs

when 5’ end of miRNAs have nearly perfect base pairing to the 3’ UTR region of mRNAs

[1, 3, 6]. During the past decade miRNA study has mainly focused on their impact

on mRNA regulation whereas interaction between miRNAs and lncRNAs is relatively

unknown.

New studies have suggested that the excessive abundance of lncRNAs may as a decoy

for miRNAs [150], establishing different hypotheses about the functional role of lncRNAs

including: (i) lncRNAs may inhibit miRNA function by sequestering them; (ii) lncRNAs

may increase the expression levels of some mRNAs by acting as a sponge for miRNAs

targeting these mRNAs. (iii) miRNAs repress lncRNA regulation by recruiting RISC in

a similar manner they target mRNAs. Recent genome-wide annotation and profiling of

long non-coding RNAs has created unprecedented opportunity to explore the role of lncR-

NAs in post-transcriptional gene regulation. In this study, we analyzed the expression

abundance of 7,535 RNA transcripts including 2,132 antisense lncRNA (resided on the

opposite strand of protein coding genes), 2,986 lincRNAs (resided on intergenic regions),

241 sense-intronic lncRNAs and 2,176 mRNAs across 26 tissues and 5 cell lines. We

investigated whether miRNAs can have any impact on lncRNA abundance and whether

lncRNAs can sponge miRNAs to promote mRNA regulation. This study is the first

that explores RNA expression data across a large number of tissues to identify possible

interaction between lncRNAs and miRNAs. Juan et al detected some lncRNAs that sig-

nificantly anti-correlated with miRNAs that have seed match sites in these lncRNAs in

normal and tumor breast samples [151]. Guttman, et al. employed lentiviral shRNAs to

silence 147 lncRNAs at an average efficacy of 75%, demonstrating that lncRNAs in gen-

eral are susceptible to regulation by Argonauts-small RNA complexes despite frequent

nucleus localization.

Our work revealed several important biological insights about interactions between


lncRNAs, mRNAs, and miRNAs. We found that the lncRNA target set of some miR-

NAs have relatively low abundance in the tissues that these miRNAs are highly active,

suggesting that miRNAs may modulate the expression of these lncRNAs in some specific

tissues similar to cell-type specific miRNA induced mRNA repression [34]. We also found

lncRNAs and mRNAs that shared many targeting miRNAs are significantly positively

correlated, indicating that these set of highly expressed lncRNAs may decoy the miRNAs

to promote mRNA regulation. Our analysis also showed that the lncRNAs that highly

expressed in the cytoplasm are under selective pressure to have less target sites com-

pared to those highly expressed in the nucleus, suggesting that miRNAs may regulate

only cytoplasmic specific lncRNAs.

5.2 Results

5.2.1 lncRNA targets of some miRNAs have relative low ex-

pression in the tissues in which the miRNAs are highly

active

We tested to see if the target set of a conserved miRNA family are repressed in a tissue

in which at least a member of the miRNA family is highly expressed compared to other

tissues. We extracted the list of lncRNA targets of 87 conserved miRNA families from

miRcode [152] (see Materials). For each miRNA family, we ranked the mean of the

expression levels of its targets across all tissues. We also ranked the expression levels of

all lncRNA targets across the tissues. We then computed the element-wise ratio of these

two ranked vectors (i.e. the rank of the target set divided by the rank of all targets) and

sorted the tissues in ascending order based on their ratio scores; thus, for a given miRNA,

the tissue with the smallest score is the tissue in which the target set of the miRNA have

relatively the lowest expression level compared to the other tissues. Interestingly, among


41 miRNA families considered in this test (number of targets > 10), we found the lncRNA

targets of 13 miRNA families have the lowest expression ranks in the tissue in which a

member of these miRNA families is highly expressed, suggesting that these miRNAs may

have repressed the expression of their targets in the tissue (Figure 1a-h).

We found that the target set of miR-375 that have lowest expression in Esophagus

compared to other tissues (Figure 1a); Li et al. recently found the miR-375 expression

level in the normal Esophagus is significantly higher than that of the cancerous Esophagus

[153], supporting our finding that highly expressed miR-375 in Esophagus may regulate

the lncRNA target set in this tissue.

We also found the target set of miR-101 have the lowest expression score in Skeletal

Muscles compared to other tissues (Figure 1b); Thomsen et al profiled the expression

levels of 212 miRNAs using deep sequencing and found the miR-101 is the second top

most expressed miRNA in this tissue after miR-1 [154], a supportive evidence that miR-

101 may mediate its target set in the Skeletal Muscle tissue. Another example is miR-122

which was shown to preferentially expressed in liver [64] where we found the target set

of this miRNA have the lowest rank (Figure 1c).

We also found that target set of miR-383 are repressed in liver compared to other

tissues (Figure 1d); miR-383 was shown to be expressed in liver resident stem cells

(HLSCs) [155]. In addition, our analysis shows that this miR-145 may down-regulate the

expression level of its target in heart; Li et al’s experiments showed that miR-145 plays

an important role in regulating mitochondrial apoptotic pathway in heart [156] (Figure

1e).

The target set of miR-34 have low expression levels in testicle and lung where miR-34

have been measured to be highly expressed [157].

Using quantitative real-time RT-PCR assays, Wang et al found that miR-23 is highly

expressed in liver, skeletal muscle, lung, heart, and kidney [158]. Interestingly, we found

that the target set of miR-23 have low relative expression scores in skeletal muscle (rank


1), kidney (rank 2), and heart (rank 4) (Figure 1f), suggesting that miR-23 mediate the

expression of their potential targets in these tissues. Additionally, miR-129 was known

to be a cerebellum specific miRNA [159] and in our list, cerebellum is the tissue in

which lncRNA targets of miR-129 have the lowest expression compared to other tissues

(Figure 1g). Also miR-203 has shown to be expressed in the normal bladder tissue [160]

and we found the target set of miR-203 have the lowest relative expression in bladder

compared to other tissues (Figure 1h). Recently Gou et al [161] detected that miR-148a

expression is relatively high in intestine, stomach, heart, colon and liver using Northern

blot experiments. Our analysis shows that the target set of miR-148a in intestine have

the lowest expression compared to the other tissues (Figure 1i). Our results also show

the low relative expression of the target set of miR-148 in brain related tissues where Gou

et al’s experiment could not find high expression level for miR-148. Another miRNA we

found in our list is miR-133 whose targets have low relative expression score in skeletal

muscle (rank 1), and brain (rank 3 and 4) (Figure 1j). Hon et al. found that miR-107

and miR-133, are indeed strongly expressed in brain, and muscle [162]. Finally, miR-

125 is expressed in normal bladder and suppresses the development of bladder cancer

by targeting E2F3 [163]. miR-125 target set have low expression score (rank1) in our

analysis (Figure 1k).

5.2.2 lncRNAs that significantly positively correlated with mR-

NAs may decoy their common targeting miRNAs

Some lncRNAs that contain miRNA target sites are suggested to compete with mRNAs to

bind to miRNAs and subsequently indirectly interfere in post-transcriptional regulation

of mRNAs [150]. In this case, lncRNAs are said to act as miRNA sponges. For example,

lncRNA linc-MD1 has shown to positively regulate the expression of mRNAs MAML1

and MEF2C by acting as sponges of their targeting miRNAs: miR-133 and miR-135

[164]. We investigated to see if mRNA-lncRNA pairs that share common miRNA target


Figure 1

0

2

4

6

8

10

12

Esoph

agus

Bladd

er

Cervix

Colon

Trach

ea

Inte

stine

Ovary

Kidne

yLu

ng

Hipoca

mp

Tempo

ral S

uper

ior

Human

Tes

tis

Entor

rinal

Corte

x Par

ietal

Amigd

ala

Corte

x Fro

ntal

Splee

n

Human

Bra

in

Skelet

al M

uscle

Place

nta

Cereb

el

HeartLiv

er

Adipo

se

Thym

us

Mes

ence

fal

Re

lativ

e ra

nk

scor

e

miR-375

0

1

2

3

4

5

Skelet

al M

uscle

ColonHea

rt

Place

nta

Bladd

er

Kidne

y

Hipoca

mp

Ovary

Inte

stine

Trach

ea

Human

Tes

tis

Thym

us

Adipo

se

Tempo

ral S

uper

ior

Corte

x Par

ietal

Amigd

ala

Entor

rinal

Corte

x Fro

ntal

Lung

Human

Bra

in

Esoph

agusLiv

er

Splee

n

Cervix

Cereb

el

Mes

ence

fal

Re

lativ

e ra

nk

scor

e

miR-101/101ab

0

1

2

3

4

5

Liver

Bladd

er

Skelet

al M

uscle

Esoph

agus

Trach

ea

Kidne

y

Human

Tes

tis

Colon

Hipoca

mp

Inte

stine

Amigd

ala

Tempo

ral S

uper

ior

Place

nta

Heart

Entor

rinal

OvaryLu

ng

Corte

x Par

ietal

Corte

x Fro

ntal

Human

Bra

in

Adipo

se

Cervix

Cereb

el

Splee

n

Mes

ence

fal

Thym

us

Re

lativ

e ra

nk

scor

e

miR-383

0

2

4

6

8

10

12

14

Heart

Skelet

al M

uscle

Bladd

er

Kidne

yLiv

er

Trach

ea

Colon

Tempo

ral S

uper

ior

Inte

stine

Human

Tes

tis

Ovary

Hipoca

mp

Amigd

ala

Corte

x Par

ietal

Esoph

agus

Corte

x Fro

ntal

Lung

Entor

rinal

Human

Bra

in

Place

nta

Adipo

se

Splee

n

Cereb

el

Cervix

Thym

us

Mes

ence

fal

Re

lativ

e ra

nk

scor

e

miR-145

0

1

2

3

4

5

Colon

Kidne

y

Inte

stine

Adipo

seLu

ng

Cereb

el

Amigd

ala

Trach

ea

Human

Tes

tisHea

rt

Entor

rinal

Mes

ence

fal

Hipoca

mp

Tempo

ral S

uper

ior

Bladd

er

Ovary

Human

Bra

in

Corte

x Par

ietal

Corte

x Fro

ntal

Place

nta

Esoph

agus

Skelet

al M

uscle

Cervix

Liver

Splee

n

Thym

us

Re

lativ

e ra

nk

scor

e

miR-22/22-3p

0

0.5

1

1.5

2

2.5

3

Kidne

y

Place

ntaLu

ng

Human

Tes

tis

Esoph

agus

Trach

ea

Colon

Amigd

ala

Entor

rinal

Hipoca

mp

Human

Bra

in

Mes

ence

fal

Cereb

el

Tempo

ral S

uper

ior

Inte

stine

Bladd

er

Cervix

Corte

x Par

ietal

Corte

x Fro

ntal

Thym

us

Ovary

Splee

n

Adipo

seHea

rtLiv

er

Skelet

al M

uscle

Re

lativ

e ra

nk

scor

e

miR-34ac/34bc-5p/449abc/449c-5p

Figure 5.1: lncRNA targets have low expression in tissues where their targeting miRNAs arehighly expressed. Each subplot shows the relative expression score of lncRNA targets for givenmiRNAs across 26 tissues


sites are enriched for significantly positively or negatively correlated pairs. We computed

the correlation coefficients between all lncRNAs and mRNAs in the dataset and excluded

insignificantly correlated pairs (i.e. those with P > 0.05). The mRNA-lncRNA pairs were

sorted based on the relative number of shared miRNA seed matches which is computed as

the number of common miRNA seed match sites between the lncRNA and mRNA divided

by the length of the lncRNA or the length of 3’ UTR of the mRMA whichever is greater.

We observed that almost half of significantly correlated pairs share at least one target site

(Figure 2.a). We performed a hyper-geometric test (see Materials) to examine if highly

positively correlated pairs are enriched in either end of the sorted list. Figure2b shows

the enrichment plot of positively correlated pairs in the top M set of the sorted list; as

M increases. We observed that first top 400 pairs (those that share about 20 % percent

of targeting miRNAs) in the sorted list significantly enriched for positively correlated

mRNA-lncRNA pairs (P < 0.01). We did not observe any significant enrichment when

M > 400. We also observed the same enrichment pattern when we did not divide the

number of common miRNA target sites by the transcript length (Figure 2 c). To test

if pairs with shared miRNAs are enriched for negatively correlated pairs, we repeated

the analysis but this time searching for negative correlated pairs; we however could not

find any enrichment for the set of negatively correlated lncRNA-mRNAs in the top M

pair of the list (data not shown). Although the enrichment level is not remarkably

significant (P < 0.01), this analysis may suggest some biological insights about possible

role of lncRNAs as miRNA sponges as described in the following. Earlier we discussed

that lncRNAs may sponge miRNAs so indirectly increase the expression levels of those

mRNAs that otherwise would have been the targets of sponged miRNAs. However what

will happen to lncRNAs after sponging miRNAs is not clear. In order for a lncRNA to

act as an effective miRNA sponge, it should highly expressed in the cell [60]; the lncRNA

transcript after sponging miRNAs can be degraded; if this occurs, the expression levels

of lncRNAs and mRNAs that compete for miRNAs should be negatively correlated. Our


Figure 2

0 5000 10000 150000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Pair mRNA-lncRNA sorted based on relative # of shared miRNAs

Rel

ativ

e #

of s

har

ed

miR

NA

s

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

lincRNAs sorted based on relative shared miRNAs with mRNAs

-log

10 P

(h

ype

rge

omet

ric

test

)

40

08

00

12

001

600

20

002

400

28

003

200

36

004

000

44

004

800

52

005

600

60

006

400

68

007

200

76

008

000

84

008

800

92

009

600

10

000

10

400

10

800

11

200

11

600

12

000

12

400

12

800

13

200

13

600

14

000

14

400

positive correlation

negative correlation

cutoff line

0

0.5

1

1.5

2

2.5

3

3.5

lncRNAs sorted based on # of shared miRNAs with mRNAs-l

og1

0 P (

hyp

erg

eom

etri

c te

st)

40

08

00

12

001

600

20

002

400

28

003

200

36

004

000

44

004

800

52

005

600

60

006

400

68

007

200

76

008

000

84

008

800

92

009

600

10

000

10

400

10

800

11

200

11

600

12

000

12

400

12

800

13

200

13

600

14

000

14

400

positive correlation

negative correlation

cutoff line

Figure 5.2: (a) Relative number of common miRNA target sites in all significantly correlatedlncRNA-mRNA pairs. (b) Enrichment plot for the set of the positively correlated pairs in theset of lncRNA-mRNA pairs sorted based on the relative number of common miRNA targetsites. (c) Same as b but the number of common miRNA target sites is not divided by thelength of the transcripts. The gray horizontal line depicts the cut-off line, i.e. P = 0.05.

analysis however does not support this hypothesis since we found that lncRNA-mRNA

pairs that share many seed match sites are more positively correlated than negatively. In

conclusions, our analysis in the section supports the following mechanism for lncRNAs

as miRNA sponges. When a mRNA and a lncRNA share more than 20 % targeting

miRNAs, the lncRNA may act as a miRNA sponge. In this mode, the lncRNA regulates

positively and indirectly the mRNA expression. Additionally, the lncRNA transcript is

not degraded by binding miRNAs possibly because these miRNAs do not recruit RISC

to mediate the expression of lncRNAs.


5.2.3 Highly expressed lncRNAs in the cytoplasm contain sig-

nificantly less seed match sites than those in the nucleus

Since mature miRNAs are formed in the cytoplasm, we tested if the cytoplasm-specific

lncRNAs are more under selective pressure to base pair with miRNAs than nucleus-

specific lncRNAs. We found that lncRNAs that highly expressed in the cytoplasm have

less seed match sites compared to those highly expressed in the nucleus (Figure 3). We

used the RNA abundance in the cytoplasm and nucleus measured using RNAseq in six

cell lines: GM12878, HepG2, HUVEC, K562,NHEK, HeLaS3 [147]. To analyze reliable

RNAseq measurements, we excluded transcripts whose RPKM < 1 in both the cytoplasm

and the nucleus. We declare a transcript highly expressed in the nucleus if the ratio of

RPKMs in the nucleus and the cytoplasm is greater than 10 and analogously for highly

expressed transcripts in the cytoplasm; we obtained cytoplasmic 33 lncRNAs and 104

nuclear lncRNAs out of total 866 RNAseq-measured transcripts in the six cell lines. To

test the possible repression of lncRNAs by miRNAs in each compartment, we compared

the expression levels of target and non-target lncRNAs in the cytoplasm and nucleus.

We reason that if mature miRNAs are formed in the cytoplasm and if miRNAs repress

the lncRNAs in the cytoplasm, the target transcripts should have lower expression levels

compared to non-target transcripts. Surprisingly, we found opposite results. First, we

found that the lncRNA targets have higher median expression in the cytoplasm than

lncRNAs non-targets and oppositely in the nucleus. However, higher expression is not

statistically significant except in the HeLaS3 cell line (Table 5.1, third row). In conclu-

sions, we could not find any significant difference between the expression levels of target

and non-target lncRNAs expressed in the cytoplasm and nucleus. Surprisingly, however,

for one cell line, HeLaS3, we found the relative expression of targets is lower than those

of non-targets in the nucleus where mature miRNAs are not thought not to be expressed.


Figure 3

Table I

Cell line Median expression

in the cytoplasm

Median expression

in the nucleus

R1= Ratio target/non

target expression in

the cytoplasm


target expression

in the nucleus

R2/R1

P-value (target

expression ,non target

expression in the cytoplasm

P-value (target expression, non

target expression in the nucleus

GM12878 1.55 1.70 1.20 0.93 0.77 0.13 0.31 HeLaS3 1.35 2.15 1.32 0.86 0.65 0.01 0.0098 HepG2 1.40 1.64 1.30 1.17 0.89 0.01 0.71 HUVEC 1.50 1.65 1.20 1.02 0.84 0.11 0.59 K562 2.10 2.32 0.90 0.81 0.90 0.60 0.25 NHEK 1.95 1.56 1.11 1.05 0.95 0.71 0.63

0 0.002 0.004 0.006 0.008 0.01 0.0120

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

P< 0.02

number of targeting miRNA per lncRNA length

Cum

ulat

ive

dist

ribut

ion

nucleus lncRNA targetscytoplasm lncRNA targets

Figure 5.3: Highly expressed lncRNAs in the cytoplasm have less seed match sites than thoseexpressed in the nucleus. Shown are the cumulative distribution of number of seed match sitesper transcript length for lncRNAs expressed in the nucleus (red) and cytoplasm (green).

Figure 3

Table I

Cell line Median expression

in the cytoplasm

Median expression

in the nucleus


target expression in

the cytoplasm


target expression

in the nucleus

R2/R1

P-value (target

expression ,non target

expression in the cytoplasm

P-value (target expression, non

target expression in the nucleus

GM12878 1.55 1.70 1.20 0.93 0.77 0.13 0.31 HeLaS3 1.35 2.15 1.32 0.86 0.65 0.01 0.0098 HepG2 1.40 1.64 1.30 1.17 0.89 0.01 0.71 HUVEC 1.50 1.65 1.20 1.02 0.84 0.11 0.59 K562 2.10 2.32 0.90 0.81 0.90 0.60 0.25 NHEK 1.95 1.56 1.11 1.05 0.95 0.71 0.63

0 0.002 0.004 0.006 0.008 0.01 0.0120

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

cumu

lative

dist

ributi

on

number of tareting miRNAs per lncRNA length

p-value =0.027691

nucleus lncRN A targetscytoplasm lncRNA targets

(RPKM)

Table 5.1: Comparison between the expression level of cytoplasmic and nucleic lncRNAs (col-umn II-IV) statistical significant of comparison (columns V-VII); each row is associated withone cell line.

5.2.4 High relative number of lncRNA targets in allosomes and

chromosomes 20-22

Our analysis showed the distribution of annotated lncRNAs across the chromosomes is

very similar to that of mRNAs (Figure 4 a and b). We found that the expression of

lncRNA targets is independent from their genomic loci. To check this, we plotted the


sorted expression levels along with chromosomal locations for all lncRNAs. Overall, we

observed the averaged expression level of target lincRNAs across samples is independent

from their locations in the genome (Figure 4 c). Interestingly, we observed that although

number of lncRNA targets is much less than the lncRNA non-targets (about 14%), their

relative number in some chromosomes is much higher than those of non-targets, especially

in chromosomes 20-22 and X and Y (Figure 4 d); more than 10 % of detected miRNAs

are located in chromosomes X, suggesting that miRNAs and lncRNAs may interact in

this genomic locus more than the others. To check if any category of lncRNA targets

is dominantly located in a specific chromosome, we bar-plotted the distribution of each

category in each chromosome and we found no propensity for any lncRNA target category

to be located in any specific chromosome (Figure4 e)

5.2.5 LncRNAs that contain seed match sites have significantly

higher expression compared to those that lack seed match

sites

Since miRNAs tend to repress the expression levels of their targets, one way to check

the activity of miRNAs is to compare the expression of target and non-target transcripts

in different tissues. We compared the expression levels of lncRNAs that contain seed

match sites (target lncRNAs) and those that lack seed match sites (non-target lncRNAs)

within 26 tissues. We observed that target lncRNAs are significantly expressed more than

non-target lncRNAs in each individual 26 tissues (Figure 5a), suggesting that miRNAs

may have fine tuning impact on highly expressed lncRNAs or binding miRNAs may not

participate in lncRNA post-regulation. For the former scenario, it is difficult to quantify

this impact unless we conduct some miRNA induced repression experiments similar to

those available for mRNAs [33]. We also analyzed the distribution of expression levels

of mRNAs, lincRNAs, antisense lncRNA, and sense intronic lncRNAs. As shown in the


Figure 4

0 5 10 15 20 250

200

400

600

800

1000

1200

1400

1600

1800

num

ber

of

lncR

NA

s

chromosome no.

0 50 100 150 200 250 300 350 4000

10

20

Exp

ress

ion

leve

l

lincRNAs

Association between lincRNAs and number of miRNA targeting sites

0 50 100 150 200 250 300 350 4000

20

40

chro

mo

som

e l

ocat

ion

0 5 10 15 20 250

0.02

0.04

0.06

0.08

0.1

0.12

fra

ctio

n o

f ln

cRN

As

chromosome no.

non-targettarget

0 5 10 15 20 250

20

40

60

80

100

120

140

fra

ctio

n o

f lnc

RN

As

chromosome no.

lincRNAantisense lncRNAsenseIntronic lncRNA

0 5 10 15 20 250

200

400

600

800

1000

1200

1400

# o

f ln

cRN

As

chromosome no.

non-targettargetFigure 5.4: Distribution of lncRNAs on the human chromosome. (a) the distribution of all lncR-

NAs, (b) sorted expression levels lncRNA targets superimposed by the chromosomal locationsof each lncRNA.(c) the distribution of mRNAs (from wiki) (d) the distribution of targets andnon-targets (e) the distribution of relative number of target and non-target lncRNAs; relativenumbers are computed as # of (non-) targets in each chromosome / # of all (non-) targets. (f)the distribution of all categories of lncRNA targets.

below Figure 5b mRNAs are expressed far more than all types of lncRNAs; among lncR-

NAs, lincRNAs are more expressed than the overlapping lncRNAs. Next we compared

the expression levels of genes harboring miRNAs target sites with those lacking sites

for all four classes of RNAs: mRNA, lincRNAs, antisense lncRNAs, and sense intronic

lncRNAs. We found for all RNAs harboring target sites are expressed at a significantly

higher level (Figure 5c). Only 14% of lncRNAs contain miRNA target sites far less than

90 % of mRNAs that contain miRNA target sites.


.

6

6.5

7

7.5

8

Mea

n E

xpre

ssio

n Le

vel

Human

Testi

s

Hipoca

mp

Kidney

Amigdala

Tempo

ral S

uperi

or

Cortex

Pari

etal

Colon

Entorrin

al

Intes

tine

Cortex

Fron

tal

Placen

ta

Trach

ea

Human

Brain

Esoph

agus

Bladde

rOva

ry

Skelet

al Mus

cle

Cerebe

lLiv

erHea

rt

Cervix

Lung

Spleen

Adipos

e

Thymus

Mesen

cefal

Target lncRNAsnon-Target lncRNAs

5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

lincRNAantisense lncRNAsenseintronic lncRNAmRNA

log expression abundance

cum

ulat

ive

dist

ribut

ion

Figure 5.5: Abundance of target and non-target lncRNAs in 26 different tissues. (a) the bar-plotof mean of expression level of target and non-target lncRNAs in each tissue. (b) The cumulativedistributions of the expression levels of lncRNAs and mRNAs measured in the microarray study.


5.2.6 Discussion

The functional roles of long non-coding RNAs is still debated, but recent study show

their loss of function impacts gene regulation at least in some tissues [58, 59]. Long non-

coding RNAs encode many miRNAs and contain matching target sites to the seed region

of many miRNAs, suggesting interaction between long and short non-coding RNAs. In

this study, using microarray data, we explored the possible interaction between miRNAs

and lncRNAs and the impact of this interaction on mRNAs. We found that the lncRNA

target set of 11 miRNA families have the lowest relative expression in the tissue in which

a member of the miRNA family is highly expressed. This observation however cannot

be applied to all tissue-specific miRNAs. For instance, miR-1 is reported to be a highly

expressed in heart and a skeletal muscles but in our ranked list these tissues place in 19th

and 22th, respectively. This suggests that high expression of miRNAs does not necessary

implies that they are functional as observed in other studies [144]. Our analysis is similar

to Sood et al [34] conducted on 87 tissues to predict the mRNA target set of miRNAs

in tissues. They found that the mRNA target set of eight highly tissue-specific miRNAs

are expressed at significantly lower expression compared to other tissues in the tissue the

miRNAs are reported to be highly expressed.

Studies conducted on mRNAs showed that there is no significant difference was found

between the expression level of target and non-target mRNAs [34], but for lncRNAs

surprisingly we found that the lncRNA targets have significantly higher expression than

non targets ones in each individual tissue. One possible explanation might be that highly

expressed lncRNA are also highly conserved so as they contain more conserved target

sites. We however rule out conservation since miRcode identifies all seed match sites

regardless of being conserved. miRNAs are suggested to have fine tuning impact on

miRNAs which may explain why highly expressed lncRNAs contain more target sites. In

this scenario miRNAs participate in post-transcriptional regulation of lncRNAs to adjust

their regulation. Therefore not expressed lncRNAs or those have low expression do not


require miRNAs to mediate their regulations.

Our study is not conclusive enough to show that if miRNAs target lncRNAs in the

cytoplasm or nucleus. We could not find any significant difference between the expres-

sion level of lncRNA targets in the nucleus and cytoplasm. Nonetheless, we found that

highly expressed lncRNAs in the cytoplasm tend to have less target sites compared to

those in nucleus, establishing hypothesis that since mature miRNAs are generated in

the cytoplasm they targets lncRNAs in there, so highly expressed lncRNAs evolutionary

selected to have less target sites to escape post-transcriptional repression by miRNAs.

Lastly, we found that the relative number of lncRNAs containing seed match sites is high

in chromosomes 20-22 and X and Y. Interestingly, X-chromosome miRNAs which consti-

tute 10% of all identified human miRNAs, are suggested to have potential roles in human

immune system and schizophrenia [165, 166]. Excessive number of lncRNAs along with

miRNAs may be an indication that they both modulate the other cis functional elements

in this chromosome.

5.3 Materials

5.3.1 Microarray data

We processed the expression levels of 7,535 transcripts (52,375 unique probes) across

31 samples profiled using GSE34894 microarray that uses the GPL15094 platform. The

transcripts include 2,132 antisense lncRNA (reside on the opposite strand of protein

coding genes) 2,986 lincRNAs (resided on intergenic regions), 241 sense-intronic lncR-

NAs and 2,176 mRNAs. We averaged the probe expression levels per transcript. Many

lncRNAs have low expression levels or not expressed at all so we excluded them from

the analysis. To do this, we averaged probe expression levels per transcript and then

computed the coefficients of variation (CV) of expression levels across the samples and

excluded transcripts with CV < 0.1. After applying this filter, 791 transcripts were left


including 603 mRNAs, 111 lincRNAs, 70 antisense lncRNAs, and 7 sense lncRNAs. The

coefficient of variation of each transcript is defined as the ratio of standard deviation and

mean of the expression levels of the genes across samples.

5.3.2 Measuring correlation between mRNAs and lncRNAs

We computed the Pearson correlation coefficients between the expression levels of all

transcript pairs and sorted the pairs based the correlation coefficients in ascending or-

der. For each pair of transcripts, we obtained the correlation coefficient, number of

targeting miRNA families, and number of shared targeting miRNAs. We focused on

mRNA-lncRNA pairs that significantly negatively or positively correlated (3,066 pairs

with Pearson correlation coefficient < −0.351 and P < 0.05). To examine which lncRNA

group (lincRNA, sense lncRNA, and antisense lncRNA) are more positively or negatively

co-expressed with mRNAs, we compared the cumulative distributions of correlation coef-

ficients associated with mRNA-lincRNA (14472), mRNA-antisense lncRNA (3618), and

mRNA-sense lncRNA (1809) pairs; we observed that sense lincRNAs are less negatively

correlated with mRNAs compared to the two other groups which exhibit same correlation

patterns with mRNAs.

5.3.3 Hyper-geometric test analysis

The test was carried out in the following manner. Let N and C denote the total number

of mRNA-lncRNA pairs and the number of positively correlated pairs in the list, respec-

tively. We chose the top M pairs in the sorted list and counted the number of positively

correlated pairs (let say K) in this set. We then tested the statistical significant of ob-

serving K positively correlated pairs in the set of top M pairs compared to C correlated

pairs in the set of N pairs using the hyper-geometric test. M was set to 400 and increased

with the step size 400 to cover all pairs in the list


5.3.4 Identifying the complementary seed match sites in the

lncRNA transcripts

We used miRcode repository to obtain the list of lncRNAs that have complementary

match sites to the seed regions of 87 highly conserved miRNA families. miRcode scan

the entire genome and used the GENECODE V10 annotation and TargetScan protocol

for defining seed match to identify the target set for lncRNA, mRNAs, and pseudo genes.

miRcode identifies 1,048,575 target sites resided in 25,973 transcripts complementary to

the seed regions of 87 highly conserved miRNA families.

Chapter 6

Conclusions and Future Work

6.1 Conclusions

miRNAs participate in many aspect of cell biology through regulating gene expression.

miRNAs are anti-oncogenic as they regulate some oncogenes. Since 2000, scientists have

been collecting evidence to understand how miRNAs recognize their targets. Unfor-

tunately even with recent advances in technology (e.g. PARCLIP) for identifying the

binding sites of microRNA-containing ribonucleoprotein complexes, we are still unable

to identify genuine miRNA targets genome-wide; the elements involved in the miRNA

induced regulation machinery are well-identified but it is unclear under which condition

this regulatory machinery is active. Experimental methods only identify the target sets

under specific conditions and for limited number of miRNAs; therefore it is inevitable to

develop computational high throughput methods. Initially it was thought that mRNA-

miRNA sequence based determinants can provide accurate and comprehensive prediction

of targets; later however many performance evaluation studies showed that these meth-

ods have low specificity and sensitivity and surprisingly generate inconsistent target sets,

necessitating the use of other evidence to augment sequence based determinants. If miR-

NAs have detectable impact on mRNA regulation under endogenous conditions, mRNA

98

Chapter 6. Conclusions and Future Work 99

expression data are the most abundant and informative resource that can be explored

to track the footprint of miRNAs. In this thesis, we used gene expression data to re-

fine target sets of miRNAs predicted by sequence-based prediction methods. In contrast

to many computational prediction methods, we considered the multiple targeting effect

of miRNAs and we didn’t use any miRNA expression data for prediction. Given an

mRNA having multiple conserved miRNA seed match sites to one or more miRNAs,

our methods score each individual miRNA based on its impact on repressing the mRNA

expression measured under endogenous conditions and in presence of other targeting miR-

NAs. BayMiR and InMiR packages are available and can be easily used to new datasets.

We used experimentally validated target sets and miRNA over-expression experiments,

two widely used benchmarks, to evaluate merits of our methods when compared with

best available methods. We introduced InMiR that predicts the target sets of intronic

miRNAs and estimates possible co-expression of an intronic miRNA and its host mRNA.

We found 22 out 57 tested intronic miRNAs are co-expressed with their host genes. We

showed the predicted targets by InMiR highly enriched for validated targets.

We developed BayMiR a Bayesian model that predicts miRNA targets using large

set of mRNA data. BayMiR obviates the need for miRNA expression data that are not

available globally. We showed that scores provided by BayMiR better reflect miRNA

targeting impact than sequence features, namely nine determinants provided by Tar-

getScan, one of the mostly used prediction techniques. In addition we showed BayMiR

outperforms CoMeTa, a recent advanced prediction method that uses gene expression

data.

In this thesis, we also studied the possible interaction between miRNAs and lncRNAs.

We found some evidence that support the role of lncRNAs as miRNA sponges as well as

miRNAs as condition-specific regulator of lncRNAs. The human genome encodes a large

number of lncRNAs, many of which are functional. Our work is the first that incorporates

the large set of lncRNAs to explore the interaction between miRNAs and lncRNA and


their indirect impact on mRNAs

6.2 Future directions

6.2.1 A Bayesian approach to decipher the TF-miRNA-mRNA-

lncRNA regulatory network

The function of a vast region of the human genome, consisting of nearly three billion

bases, is still unknown. Researchers have already identified regions that encode proteins

that comprises less than 2% of the genome. Recently the ENCODE has released an

unprecedented expansive resource of genomic data that illuminates the possible func-

tional elements of 80% of the human genome, much of it is transcribed into functional

non-coding RNAs [167]. This new resource not only has transformed the biologists’ view

of the genome but also presents new computational and data-analysis challenges in ge-

nomics. With such a new resource in hand I propose a Bayesian graphical model to study

the interaction between mRNAs, lncRNAs, miRNAs and TFs. Our proposed network

consists of four functional elements: transcription factors (TFs), protein coding RNAs

(mRNAs), long non-coding RNAs (lncRNAs), and miRNAs. Fig. 6.1 shows a graphical

representation of the proposed model and presumed casual relationship between vari-

ables. In this model, TFs activities control the transcription rate of mRNAs, lncRNAs,

and miRNAs. Subsequently miRNAs regulate the expression of mRNAs and lncRNAs.

This model provides a wiring diagram for a cell with which we ultimately hope to predict

the impact of post-transcriptional elements on unexplored sequences, expanding insight

into the function of lncRNAs. Some lncRNAs have been shown to encode defined prod-

ucts whose sequence variants linked to human disease. Additionally these data sets allow

to explore biological function of these RNAs in major cellular sub-compartment and

different cell lines. Unlike mRNAs, lncRNAs have restricted expression in only a sub-


Research Proposal Hossein Radfar

The function of a vast region of the human genome, consisting of nearly three billion bases, is still unknown. Researchers have already identified regions that encode proteins that comprises less than 2% of the genome. Recently the ENCODE (Encyclopedia of DNA Elements) project, a consortium consisting of more than 400 scientists, has released an unprecedented expansive resource of genomic data that illuminates the functional elements of 80% of the human genome, much of it is transcribed into functional non-coding RNAs. This new resource not only has transformed the biologists’ view of the genome but also presents new computational and data-analysis challenges in genomics.

With such a new resource in hand and building on extensive experiences I obtained during my PhD program in the field of computational molecular biology, I propose a Bayesian graphical model to predict the human post-transcriptional regulatory network considering the new functional elements provided by ENCODE. Our proposed network consists of four functional elements: transcription factors (TFs), protein coding RNAs (mRNAs), long non-coding RNAs (lncRNAs), and microRNAs (miRNAs). Fig.1 shows a graphical representation of the proposed model and presumed casual relationship between variables. In this model, TFs activities control the transcription rate of mRNAs, lncRNAs, and miRNAs. Subsequently miRNAs regulate the expression of mRNAs and lncRNAs.

This model provides a wiring diagram for a cell with which we ultimately hope to predict the impact of post-transcriptional elements on unexplored sequences, expanding insight into the function of lncRNAs. There are a couple of important contributions associated with this work. The ENCODE catalogue has annotated 9640 lncRNAs, almost half of annotated protein-coding genes. Recently, it has been shown that lncRNAs encode defined products whose sequence variants linked to human disease. There are also growing numbers of experimental evidence showing that miRNAs interfere lncRNAs regulation. Our work will provide a computational genome-wide tool to predict impact of miRNAs on lncRNAs/mRNAs. Additionally, ENCODE now allows to explore biological function of these RNAs in major cellular sub-compartment and different cell lines. Unlike mRNAs, lncRNAs have restricted expression in only a subpopulation of cells. Therefore our model will be tuned to work under individual cell types and cellular compartments. The model aims at computing the posterior probability of binary variables that demonstrate the impact of TFs/ miRNAs on mRNAs /lncRNAs regulation. In this model, the inference is carried out using variational/stochastic sampling methods which we have been used in our previous works. The components of the posterior probability, i.e. the likelihood and prior probabilities, are obtained respectively from the ENCODE catalogue and sequence matching determinants already used for miRNA-mRNA pairs.

TF

mRNA miRNA cRNAln

kg, kl ,kgS , klS,

gy kw

nt

lz

?),,,,,|11( ,,,, klkgnklgklkg sstwzyorp

:, lg zy

:, nk tw

:s

:, ,, klkn

Although many aspects of post-transcriptional regulation are yet to be fully explored, we hope this

project sheds more light on global impact of miRNAs on major functional elements of every cell of every person and across time.

Figure 6.1: A graphical representation of the proposed method

population of cells. Therefore our model will be tuned to work under individual cell types

and cellular compartments. The model aims at computing the posterior probability of

binary variables that demonstrate the impact of TFs/ miRNAs on mRNAs /lncRNAs

regulation. In this model, the inference is carried out using variational/stochastic sam-

pling methods which we have been used in our previous works. The components of the

posterior probability, i.e. the likelihood and prior probabilities, are obtained respec-

tively from the ENCODE catalogue and sequence matching determinants already used

for miRNA-mRNA pairs. Although many aspects of post-transcriptional regulation are

yet to be fully explored, we hope this model sheds more light on global impact of miRNAs

on major functional elements of every cell of every person and across time.

6.2.2 Identifying lncRNA binding sites complementary to mRNA

sequences

Recently some study has shown lncRNAs partially bind to mRNAs and promote or inhibit

their regulation. There is however no genome-wide bioinformatic method available to

provide a map of these interactions; such a network can provide some biological insight

about possible regulatory impact of lncRNAs on mRNAs. We have implemented a local

sequence alignment method that can be readily applied to provide local binding regions.


Since we have no information about the length and strength of these complementary

regions the penalty scores for gaps, gap extension, wobble, mismatch should be carefully

tuned. We expect to obtain a large number of hits so we need to perform statistical

tests to refine these hits. How to perform this statistical test is unclear. Identifying this

interactions provide a lncRNA-mRNA network. This network can be analyzed in various

ways. For instance we can perform enrichment analysis to explore if the set of mRNAs

targeted by a lncRNA are enriched in a particular pathway or process.

6.2.3 Using sequence and expression evidence in parallel

Many functional elements of the human genome are potential targets of miRNAs due

to the mammalian partially bound miRNA induced regulation mechanism. Although

perfect base pairing to the seed region of miRNAs has so far been the most prominent

feature in recognizing miRNAs targets, many validated targets contain mismatch and

wobbles in their seed match sites. Moreover, many studies have demonstrated the im-

portance of other sequence structure and base pairing beyond seed match sites including

symmetric loops in the centre of the duplex, sequence match at 3’ UTR end of miRNAs.

Current methods commonly apply dynamic programming sequence alignment techniques

to score the strength of mRNA-miRNA duplexes. These scoring techniques use a set

of heuristically tuned gap, gap extension, mismatch penalties to obtain a final score; in

addition they use a heuristically determined threshold to filter out low score duplexes.

Since these parameters (e.g. threshold, gap, gap extension, and mismatch penalties) are

heuristically and globally used, they may not reflect the strength of mRNA-mRNA du-

plexes under specific conditions. For instance, one particular miRNA may have a lower

alignment score than the other but be functional under a specific condition; in this case

using a set of fixed parameters may not reveal actual targets. These shortcomings ne-

cessitate the use of probabilistic models that can effectively learn and infer condition

specific mRNA-miRNA interactions that encompass complicated and case specific base


pairing. One important aspect of our model is that in contrast to previous models that

use the expression data as a way to refine the interaction determined by sequence evi-

dences , our model tends to use the expression data in parallel to sequence evidences.

Since many miRNAs share the same targets, participate in the same pathways, or have

similar structure, we can compare miRNA-mRNA probabilistic models to decipher these

similarities. Using this model we can incorporate all information pertinent to miRNA-

gene interactions (sequence, expression and context determinants) to obtain a reliable

prediction.

Bibliography

[1] D.P. Bartel. MicroRNAs: target recognition and regulatory functions. Cell,

136(2):215–233, 2009.

[2] B. John, A.J. Enright, A. Aravin, T. Tuschl, C. Sander, D.S. Marks, et al. Human

microRNA targets. PLoS Biol, 2(11):e363, 2004.

[3] N. Rajewsky. microRNA target predictions in animals. Nature genetics, 38:S8–S13,

2006.

[4] B. Zhang, X. Pan, G.P. Cobb, and T.A. Anderson. microRNAs as oncogenes and

tumor suppressors. Developmental biology, 302(1):1–12, 2007.

[5] S.L. Ameres, J. Martinez, and R. Schroeder. Molecular basis for target RNA

recognition and cleavage by human RISC. Cell, 130(1):101–112, 2007.

[6] B.P. Lewis, I. Shih, et al. Prediction of mammalian microRNA targets. Cell,

115(7):787–798, 2003.

[7] A. Grimson, K.K.H. Farh, W.K. Johnston, P. Garrett-Engele, L.P. Lim, and D.P.

Bartel. MicroRNA targeting specificity in mammals: determinants beyond seed

pairing. Molecular cell, 27(1):91–105, 2007.

[8] D. Betel, A. Koppal, P. Agius, C. Sander, and C. Leslie. Comprehensive modeling

of microRNA targets predicts functional non-conserved and non-canonical sites.

Genome biology, 11(8):R90, 2010.

104

BIBLIOGRAPHY 105

[9] Mohsen Khorshid, Jean Hausser, Mihaela Zavolan, and Erik van Nimwegen. A bio-

physical mirna-mrna interaction model infers canonical and noncanonical targets.

Nature Methods, 2013.

[10] Doron Betel, Manda Wilson, Aaron Gabow, Debora S Marks, and Chris Sander.

The microrna. org resource: targets and expression. Nucleic acids research,

36(suppl 1):D149–D153, 2008.

[11] M. Rehmsmeier, P. Steffen, M. Hochsmann, and R. Giegerich. Fast and effective

prediction of microRNA/target duplexes. Rna, 10(10):1507, 2004.

[12] R.C. Friedman, K.K.H. Farh, C.B. Burge, and D.P. Bartel. Most mammalian

mRNAs are conserved targets of microRNAs. Genome Research, 19(1):92, 2009.

[13] C.B. Nielsen, N. Shomron, R. Sandberg, E. Hornstein, J. Kitzman, and C.B. Burge.

Determinants of targeting by endogenous and exogenous microRNAs and siRNAs.

Rna, 13(11):1894, 2007.

[14] D. Gaidatzis, E. Van Nimwegen, J. Hausser, and M. Zavolan. Inference of miRNA

targets using evolutionary conservation and pathway analysis. BMC bioinformatics,

8(1):69, 2007.

[15] M. Kertesz, N. Iovino, U. Unnerstall, U. Gaul, and E. Segal. The role of site

accessibility in microRNA target recognition. Nature genetics, 39(10):1278–1284,

2007.

[16] H. Tafer, S.L. Ameres, G. Obernosterer, C.A. Gebeshuber, R. Schroeder, J. Mar-

tinez, and I.L. Hofacker. The impact of target site accessibility on the design of

effective siRNAs. Nature biotechnology, 26(5):578–583, 2008.

[17] W.H. Majoros and U. Ohler. Spatial preferences of microRNA targets in 3’ un-

translated regions. BMC genomics, 8(1):152, 2007.

BIBLIOGRAPHY 106

[18] D.M. Garcia, D. Baek, C. Shin, G.W. Bell, A. Grimson, and D.P. Bartel. Weak

seed-pairing stability and high target-site abundance decrease the proficiency of lsy-

6 and other micrornas. Nature structural & molecular biology, 18(10):1139–1146,

2011.

[19] A. Arvey, E. Larsson, C. Sander, C.S. Leslie, and D.S. Marks. Target mrna abun-

dance dilutes microrna and sirna activity. Molecular systems biology, 6(1), 2010.

[20] W. Ritchie, S. Flamant, and J.E.J. Rasko. Predicting microRNA targets and func-

tions: traps for the unwary. Nature Methods, 6(6):397–398, 2009.

[21] C. Barbato, I. Arisi, M.E. Frizzo, R. Brandi, L. Da Sacco, and A. Masotti. Com-

putational challenges in mirna target predictions: to be or not to be a true target?

Journal of biomedicine and biotechnology, 2009, 2009.

[22] T. Saito and P. Sætrom. Micrornas–targeting and target prediction. New biotech-

nology, 27(3):243–249, 2010.

[23] M. Hammell. Computational methods to identify miRNA targets. In Seminars in

Cell & Developmental Biology. Elsevier, 2010.

[24] P. Alexiou, M. Maragkakis, G.L. Papadopoulos, M. Reczko, and A.G. Hatzigeor-

giou. Lost in translation: an assessment and perspective for computational mi-

crorna target identification. Bioinformatics, 25(23):3049–3055, 2009.

[25] H. Min and S. Yoon. Got target?: computational methods for microrna target

prediction and their extension. Experimental & molecular medicine, 42(4):233,

2010.

[26] S. Griffiths-Jones, R. J. Grocock, S. van Dongen, A. Bateman, and A.J. Enright.

miRBase: microRNA sequences, targets and gene nomenclature. NAR, 34:140–

144, 2006.

BIBLIOGRAPHY 107

[27] S. Griffiths-Jones, H. K. Saini, S. van Dongen, and A. J. Enright. miRBase: tools

for microRNA genomics. Nucleic Acids Research, 36:154–158, 2008.

[28] S. Lall, D. Grun, A. Krek, K. Chen, Y.L. Wang, C.N. Dewey, P. Sood, T. Colombo,

N. Bray, P. MacMenamin, et al. A genome-wide map of conserved microRNA

targets in C. elegans. Current biology, 16(5):460–471, 2006.

[29] I. Ioshikhes, S. Roy, and C.K. Sen. Algorithms for mapping of mRNA targets for

microRNA. DNA and Cell Biology, 26(4):265–272, 2007.

[30] J. Hausser, P. Berninger, C. Rodak, Y. Jantscher, S. Wirth, and M. Zavolan. MirZ:

an integrated microRNA expression atlas and target prediction resource. Nucleic

Acids Research, 37(Web Server issue):W266, 2009.

[31] H. Guo, N.T. Ingolia, J.S. Weissman, and D.P. Bartel. Mammalian microRNAs

predominantly act to decrease target mRNA levels. Nature, 466(7308):835–840,

2010.

[32] S. Mukherji, M.S. Ebert, G.X.Y. Zheng, J.S. Tsang, P.A. Sharp, and A. van Oude-

naarden. Micrornas can generate thresholds in target gene expression. Nature

genetics, 43(9):854–859, 2011.

[33] L.P. Lim, N.C. Lau, P. Garrett-Engele, A. Grimson, J.M. Schelter, J. Castle, D.P.

Bartel, P.S. Linsley, and J.M. Johnson. Microarray analysis shows that some mi-

croRNAs downregulate large numbers of target mRNAs. Nature, 433(7027):769–

773, 2005.

[34] P. Sood, A. Krek, M. Zavolan, G. Macino, and N. Rajewsky. Cell-type-specific

signatures of microRNAs on target mRNA expression. Proceedings of the National

Academy of Sciences of the United States of America, 103(8):2746, 2006.

BIBLIOGRAPHY 108

[35] W. Filipowicz, S.N. Bhattacharyya, and N. Sonenberg. Mechanisms of post-

transcriptional regulation by microRNAs: are the answers in sight? Nature Reviews

Genetics, 9(2):102–114, 2008.

[36] D. Baek, J. Villen, C. Shin, F.D. Camargo, S.P. Gygi, and D.P. Bartel. The impact

of microRNAs on protein output. Nature, 455(7209):64–71, 2008.

[37] M. Selbach, B. Schwanhausser, N. Thierfelder, Z. Fang, R. Khanin, and N. Ra-

jewsky. Widespread changes in protein synthesis induced by microRNAs. Nature,

455(7209):58–63, 2008.

[38] D.T. Humphreys, B.J. Westman, D.I.K. Martin, and T. Preiss. MicroRNAs control

translation initiation by inhibiting eukaryotic initiation factor 4E/cap and poly (A)

tail function. Proceedings of the National Academy of Sciences of the United States

of America, 102(47):16961, 2005.

[39] A.A. Khan, D. Betel, M.L. Miller, C. Sander, C.S. Leslie, and D.S. Marks. Transfec-

tion of small RNAs globally perturbs gene regulation by endogenous microRNAs.

Nature biotechnology, 27(6):549–555, 2009.

[40] J. Vivek, M. David, and Y. Yee. Identification of microrna-mrna modules using

microarray data. BMC Genomics, 12.

[41] B. Liu, L. Liu, A. Tsykin, G.J. Goodall, J.E. Green, M. Zhu, C.H. Kim, and J. Li.

Identifying functional mirna–mrna regulatory modules with correspondence latent

dirichlet allocation. Bioinformatics, 26(24):3105–3111, 2010.

[42] G. Sales, A. Coppe, A. Bisognin, M. Biasiolo, S. Bortoluzzi, and C. Romualdi.

Magia, a web-based tool for mirna and genes integrated analysis. Nucleic acids

research, 38(suppl 2):W352–W359, 2010.

BIBLIOGRAPHY 109

[43] W. Yu-Ping and L. Kuo-Bin. Correlation of expression profiles between microRNAs

and mRNA targets using NCI-60 data. BMC Genomics, 10.

[44] V. Jayaswal, M. Lutherborrow, D.D.F. Ma, and Y.H. Yang. Identification of mi-

crornas with regulatory potential using a matched microrna-mrna time-course data.

Nucleic acids research, 37(8):e60–e60, 2009.

[45] Y. Ruike, A. Ichimura, S. Tsuchiya, K. Shimizu, R. Kunimoto, Y. Okuno, and

G. Tsujimoto. Global correlation analysis for micro-RNA and mRNA expression

profiles in human cell lines. Journal of human genetics, 53(6):515–523, 2008.

[46] X. Li, R. Gill, N.G.F. Cooper, J.K. Yoo, and S. Datta. Modeling microrna-mrna

interactions using pls regression in human colon cancer. BMC medical genomics,

4(1):44, 2011.

[47] A. Muniategui, R. Nogales-Cadenas, M. Vazquez, X.L. Aranguren, X. Agirre,

A. Luttun, F. Prosper, A. Pascual-Montano, and A. Rubio. Quantification of

mirna-mrna interactions. PloS one, 7(2):e30766, 2012.

[48] G.T. Huang, C. Athanassiou, and P.V. Benos. mirconnx: condition-specific mrna-

microrna network integrator. Nucleic acids research, 39(suppl 2):W416–W423,

2011.

[49] S. Nam, M. Li, K. Choi, C. Balch, S. Kim, and K.P. Nephew. Microrna and

mrna integrated analysis (mmia): a web tool for examining biological functions of

microrna expression. Nucleic acids research, 37(suppl 2):W356–W362, 2009.

[50] S. Wuchty, D. Arjona, A. Li, Y. Kotliarov, J. Walling, S. Ahn, A. Zhang, D. Maric,

R. Anolik, J.C. Zenklusen, et al. Prediction of associations between micrornas and

gene expression in glioma biology. PLoS One, 6(2):e14681, 2011.

BIBLIOGRAPHY 110

[51] J. C. Huang, T. Babak, T. W. Corson, G. Chua, S. Khan, B. L. Gallie, T. R.

Hughes, B. J. Blencowe, B. J. Frey, and Q. D. Morris. Using expression profiling

data to identify human microRNA target. Nature Methods, 4:1045–1049, 2007.

[52] J. Huang, Q. Morris, and B. Frey. Detecting microRNA targets by linking sequence,

microRNA and gene expression data. In Research in Computational Molecular

Biology, pages 114–129. Springer, 2006.

[53] JC Huang, BJ Frey, and QD Morris. Comparing sequence and expression for. In

Pacific Symposium on Biocomputing, volume 13, pages 52–63, 2008.

[54] S. van Dongen, C. Abreu-Goodger, and A.J. Enright. Detecting microrna binding

and sirna off-target effects from expression data. Nature methods, 5(12):1023–1025,

2008.

[55] V. A. Gennarino and et al. Identification of microrna-regulated gene networks by

expression analysis of target genes. Genome Research, 2012.

[56] V. A. Gennarino, M. Sardiello, R. Avellino, N. Meola, V. Maselli, S. Anand, L. Cu-

tillo, A. Ballabio, and S. Banfi. MicroRNA target prediction by expression analysis

of host genes. Genome Res, 19:481–490, Dec. 2008.

[57] W. Ritchie, S. Flamant, and J.E.J. Rasko. mimirna: a microrna expression pro-

filer and classification resource designed to identify functional correlations between

micrornas and their targets. Bioinformatics, 26(2):223–227, 2010.

[58] Je-Hyun Yoon, Kotb Abdelmohsen, and Myriam Gorospe. Post-transcriptional

gene regulation by long noncoding rna. Journal of molecular biology, 2012.

[59] Mitchell Guttman, Julie Donaghey, Bryce W Carey, Manuel Garber, Jennifer K

Grenier, Glen Munson, Geneva Young, Anne Bergstrom Lucas, Robert Ach, Lau-

BIBLIOGRAPHY 111

rakay Bruhn, et al. lincrnas act in the circuitry controlling pluripotency and dif-

ferentiation. Nature, 477(7364):295–300, 2011.

[60] Margaret S Ebert and Phillip A Sharp. Microrna sponges: progress and possibili-

ties. Rna, 16(11):2043–2050, 2010.

[61] Huidong Wang, Anna Iacoangeli, Daisy Lin, Keith Williams, Robert B Denman,

Christopher UT Hellen, and Henri Tiedge. Dendritic bc1 rna in translational control

mechanisms. The Journal of cell biology, 171(5):811–821, 2005.

[62] Maite Huarte, Mitchell Guttman, David Feldser, Manuel Garber, Magdalena J

Koziol, Daniela Kenzelmann-Broz, Ahmad M Khalil, Or Zuk, Ido Amit, Michal

Rabani, et al. A large intergenic noncoding rna induced by p53 mediates global

gene repression in the p53 response. Cell, 142(3):409–419, 2010.

[63] A. Rodriguez, S. Griffiths-Jones, J.L. Ashurst, and A. Bradley. Identification

of mammalian microRNA host genes and transcription units. Genome research,

14(10a):1902, 2004.

[64] S. Baskerville and D. P. Bartel. Microarray profiling of microRNAs reveals frequent

coexpression with neighboring miRNAs and host genes. RNA, 11(3):241–247, 2005.

[65] J. Lu, G. Getz, E.A. Miska, E. Alvarez-Saavedra, J. Lamb, D. Peck, A. Sweet-

Cordero, B.L. Ebert, R.H. Mak, A.A. Ferrando, et al. MicroRNA expression profiles

classify human cancers. Nature, 435(7043):834–838, 2005.

[66] R. Bargaje, M. Hariharan, V. Scaria, and B. Pillai. Consensus miRNA expres-

sion profiles derived from interplatform normalization of microarray data. RNA,

16(1):16, 2010.

[67] Y. Liang, D. Ridzon, L. Wong, and C. Chen. Characterization of microRNA ex-

pression profiles in normal human tissues. BMC genomics, 8(1):166, 2007.

BIBLIOGRAPHY 112

[68] P.E. Blower, J.S. Verducci, S. Lin, J. Zhou, J.H. Chung, and et al. MicroRNA

expression profiles for the NCI-60 cancer cell panel. Molecular Cancer Therapeutics,

6(5):1483, 2007.

[69] D. Wang, M. Lu, J. Miao, T. Li, E. Wang, and Q. Cui. Cepred: predicting the

co-expression patterns of the human intronic microRNAs with their host genes.

PLoS One, 4(2), 2009.

[70] D. Ronchetti, M. Lionetti, L. Mosca, L. Agnelli, A. Andronache, S. Fabris, G.L.

Deliliers, and A. Neri. An integrative genomic approach reveals coordinated ex-

pression of intronic miR-335, miR-342, and miR-561 with deregulated host genes

in multiple myeloma. BMC Medical Genomics, 1(1):37, 2008.

[71] Y. K. Kim and V. N. Kim. Processing of intronic microRNAs. The EMBO Journal,

26:775–783, 2007.

[72] S.C. Li, P. Tang, and W.C. Lin. Intronic microRNA: discovery and biological

implications. DNA and Cell Biology, 26(4):195–207, 2007.

[73] J. Piriyapongsa, L. Marino-Ramırez, and I.K. Jordan. Origin and evolution of

human micrornas from transposable elements. Genetics, 176(2):1323–1337, 2007.

[74] J. Khatun. An integrated encyclopedia of dna elements in the human genome.

Nature, 2012.

[75] H. Ishizu, H. Siomi, and M.C. Siomi. Biology of piwi-interacting rnas: new insights

into biogenesis and function inside and outside of germlines. Genes & Development,

26(21):2361–2373, 2012.

[76] R.C. Lee, R.L. Feinbaum, V. Ambros, et al. The c. elegans heterochronic gene lin-4

encodes small rnas with antisense complementarity to lin-14. Cell, 75(5):843–854,

1993.

BIBLIOGRAPHY 113

[77] A. Esquela-Kerscher and F.J. Slack. Oncomirs micrornas with a role in cancer.

Nature Reviews Cancer, 6(4):259–269, 2006.

[78] K. Steffy, C. Allerson, and B. Bhat. Perspectives in microrna therapeutics. Phar-

maceutical Technology, 35:a18–s24, 2011.

[79] G.A. Calin and C.M. Croce. Microrna signatures in human cancers. Nature Reviews

Cancer, 6(11):857–866, 2006.

[80] S. Volinia, G.A. Calin, C.G. Liu, S. Ambs, A. Cimmino, F. Petrocca, R. Visone,

M. Iorio, C. Roldo, M. Ferracin, et al. A microrna expression signature of human

solid tumors defines cancer gene targets. Proceedings of the National Academy of

Sciences of the United States of America, 103(7):2257–2261, 2006.

[81] A.G. Uren, J. Kool, K. Matentzoglu, J. De Ridder, J. Mattison, M. Van Uitert,

W. Lagcher, D. Sie, E. Tanger, T. Cox, et al. Large-scale mutagenesis in¡ i¿

p19arf¡/i¿-and¡ i¿ p53¡/i¿-deficient mice identifies cancer genes and their collab-

orative networks. Cell, 133(4):727–741, 2008.

[82] C.M. Croce and G.A. Calin. mirnas, cancer, and stem cell division. Cell, 122(1):6–7,

2005.

[83] J.A. Chan, A.M. Krichevsky, and K.S. Kosik. Microrna-21 is an antiapoptotic

factor in human glioblastoma cells. Cancer research, 65(14):6029, 2005.

[84] S. Djebali, C.A. Davis, A. Merkel, A. Dobin, T. Lassmann, A. Mortazavi, A. Tanzer,

J. Lagarde, W. Lin, F. Schlesinger, et al. Landscape of transcription in human cells.

Nature, 489(7414):101–108, 2012.

[85] Inha Heo, Chirlmin Joo, Young-Kook Kim, Minju Ha, Mi-Jeong Yoon, Jun Cho,

Kyu-Hyeon Yeom, Jinju Han, and V Narry Kim. Tut4 in concert with lin28 sup-

BIBLIOGRAPHY 114

presses microrna biogenesis through pre-microrna uridylation. Cell, 138(4):696–708,

2009.

[86] G. Hutvagner and M.J. Simard. Argonaute proteins: key players in rna silencing.

Nature Reviews Molecular Cell Biology, 9(1):22–32, 2008.

[87] J.G. Ruby, C.H. Jan, and D.P. Bartel. Intronic microrna precursors that bypass

drosha processing. Nature, 448(7149):83–86, 2007.

[88] Natascha Bushati and Stephen M Cohen. microrna functions. Annu. Rev. Cell

Dev. Biol., 23:175–205, 2007.

[89] R. Parker and U. Sheth. P bodies and the control of mrna translation and degra-

dation. Molecular cell, 25(5):635–646, 2007.

[90] A. Eulalio, F. Tritschler, and E. Izaurralde. The gw182 protein family in animal

cells: new insights into domains required for mirna-mediated gene silencing. Rna,

15(8):1433–1442, 2009.

[91] Antonio J Giraldez, Yuichiro Mishima, Jason Rihel, Russell J Grocock, Stijn

Van Dongen, Kunio Inoue, Anton J Enright, and Alexander F Schier. Ze-

brafish mir-430 promotes deadenylation and clearance of maternal mrnas. science,

312(5770):75–79, 2006.

[92] A. Eulalio, E. Huntzinger, T. Nishihara, J. Rehwinkel, M. Fauser, and E. Izaurralde.

Deadenylation is a widespread effect of mirna regulation. Rna, 15(1):21–32, 2009.

[93] L. He and G.J. Hannon. Micrornas: small rnas with a big role in gene regulation.

Nature Reviews Genetics, 5(7):522–531, 2004.

[94] P.S. Linsley, J. Schelter, J. Burchard, M. Kibukawa, M.M. Martin, S.R. Bartz, J.M.

Johnson, J.M. Cummins, C.K. Raymond, H. Dai, et al. Transcripts targeted by

BIBLIOGRAPHY 115

the microrna-16 family cooperatively regulate cell cycle progression. Molecular and

cellular biology, 27(6):2240–2252, 2007.

[95] T.C. Chang, E.A. Wentzel, O.A. Kent, K. Ramachandran, M. Mullendore, K.H.

Lee, G. Feldmann, M. Yamakuchi, M. Ferlito, C.J. Lowenstein, et al. Transactiva-

tion of mir-34a by p53 broadly influences gene expression and promotes apoptosis.

Molecular cell, 26(5):745, 2007.

[96] J. Krutzfeldt, N. Rajewsky, R. Braich, K.G. Rajeev, T. Tuschl, M. Manoharan, and

M. Stoffel. Silencing of micrornas in vivo with antagomirs. Nature, 438(7068):685–

689, 2005.

[97] B. Gentner, G. Schira, A. Giustacchini, M. Amendola, B.D. Brown, M. Ponzoni,

and L. Naldini. Stable knockdown of microrna in vivo by lentiviral vectors. Nature

methods, 6(1):63–66, 2008.

[98] M.S. Ebert, J.R. Neilson, and P.A. Sharp. Microrna sponges: competitive inhibitors

of small rnas in mammalian cells. Nature methods, 4(9):721–726, 2007.

[99] Y.G. Li, P.P. Zhang, K.L. Jiao, and Y.Z. Zou. Knockdown of microrna-181 by

lentivirus mediated sirna expression vector decreases the arrhythmogenic effect of

skeletal myoblast transplantation in rat with myocardial infarction. Microvascular

research, 78(3):393–404, 2009.

[100] S.W. Chi, J.B. Zang, A. Mele, and R.B. Darnell. Argonaute hits-clip decodes

microrna–mrna interaction maps. Nature, 460(7254):479–486, 2009.

[101] M. Hafner, M. Landthaler, L. Burger, M. Khorshid, J. Hausser, P. Berninger,

A. Rothballer, M. Ascano Jr, A.C. Jungkamp, M. Munschauer, et al.

Transcriptome-wide identification of rna-binding protein and microrna target sites

by par-clip. Cell, 141(1):129–141, 2010.

BIBLIOGRAPHY 116

[102] F.E. Nicolas. Experimental validation of microrna targets using a luciferase reporter

system. Methods Mol Biol, 732:139–52, 2011.

[103] W. Van Leeuwen, M.J.M. Hagendoorn, T. Ruttink, R. Van Poecke, L.H.W. Van

Der Plas, and A.R. Van Der Krol. The use of the luciferase reporter system for in

planta gene expression studies. Plant Molecular Biology Reporter, 18(2):143–144,

2000.

[104] Ellen Siebring-van Olst, Christie Vermeulen, Renee X de Menezes, Michael How-

ell, Egbert F Smit, and Victor W van Beusechem. Affordable luciferase reporter

assay for cell-based high-throughput screening. Journal of biomolecular screening,

18(4):453–461, 2013.

[105] A. Kozomara and S. Griffiths-Jones. mirbase integrating microrna annotation and

deep-sequencing data. Nucleic acids research, 39(suppl 1):D152–D157, 2011.

[106] A. Krek, D. Grun, M.N. Poy, R. Wolf, L. Rosenberg, E.J. Epstein, P. MacMenamin,

I. da Piedade, K.C. Gunsalus, M. Stoffel, et al. Combinatorial microRNA target

predictions. Nature genetics, 37(5):495–500, 2005.

[107] N. Rajewsky, M. Vergassola, U. Gaul, and E.D. Siggia. Computational detection of

genomic cis-regulatory modules applied to body patterning in the early drosophila

embryo. BMC bioinformatics, 3(1):30, 2002.

[108] A.J. Enright, B. John, U. Gaul, T. Tuschl, C. Sander, D.S. Marks, et al. Microrna

targets in drosophila. Genome biology, 5(1):1–1, 2004.

[109] J. C. Huang, Q. D. Morris, and B. J. Frey. Bayesian inference of microRNA targets

from sequence and expression data. Journal of Computational Biology, 14:550–563,

2007.

BIBLIOGRAPHY 117

[110] M.H. Radfar, W. Wong, and Q. Morris. Computational prediction of intronic

microrna targets using host gene expression reveals novel regulatory mechanisms.

PLoS One, 6(6):e19312, 2011.

[111] A.M. Monteys, R.M. Spengler, J. Wan, L. Tecedor, K.A. Lennox, Y. Xing, and B.L.

Davidson. Structure and activity of putative intronic miRNA promoters. RNA,

16(3):495, 2010.

[112] F. Ozsolak, L.L. Poling, Z. Wang, H. Liu, X.S. Liu, R.G. Roeder, X. Zhang, J.S.

Song, and D.E. Fisher. Chromatin structure analyses identify miRNA promoters.

Genes & development, 22(22):3172, 2008.

[113] N.J. Martinez, M.C. Ow, J.S. Reece-Hoyes, M.I. Barrasa, V.R. Ambros, and A.J.M.

Walhout. Genome-scale spatiotemporal analysis of Caenorhabditis elegans mi-

croRNA promoter activity. Genome research, 18(12):2005, 2008.

[114] X. Wang, Z. Xuan, X. Zhao, Y. Li, and M.Q. Zhang. High-resolution human

core-promoter prediction with CoreBoost HM. Genome research, 19(2):266, 2009.

[115] D. Golan, C. Levy, B. Friedman, and N. Shomron. Biased hosting of intronic

microRNA genes. Bioinformatics, 26(8):992, 2010.

[116] J. Ernst, H.L. Plasterer, I. Simon, and Z. Bar-Joseph. Integrating multiple evidence

sources to predict transcription factor binding in the human genome. Genome

research, 20(4):526, 2010.

[117] D.L. Corcoran, K.V. Pandit, B. Gordon, A. Bhattacharjee, N. Kaminski, and P.V.

Benos. Features of mammalian microRNA promoters emerge from polymerase II

chromatin immunoprecipitation data. PLoS One, 4(4):5279, 2009.

[118] X. Zhou, J. Ruan, G. Wang, and W. Zhang. Characterization and identification

BIBLIOGRAPHY 118

of microRNA core promoters in four model species. PLoS Comput Biol, 3(3):e37,

2007.

[119] Richard Durbin, Sean R Eddy, Anders Krogh, and Graeme Mitchison. Biological

sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge

university press, 1998.

[120] ME Peter. Targeting of mRNAs by multiple miRNAs: the next step. Oncogene,

29(15):2161–2164, 2010.

[121] S. Vasudevan, Y. Tong, and J.A. Steitz. Switching from repression to activation:

microRNAs can up-regulate translation. Science, 318(5858):1931, 2007.

[122] S. Vasudevan and J.A. Steitz. AU-rich-element-mediated upregulation of transla-

tion by FXR1 and Argonaute 2. Cell, 128(6):1105–1118, 2007.

[123] K.D. Swisher and R. Parker. Localization to, and Effects of Pbp1, Pbp4, Lsm12,

Dhh1, and Pab1 on Stress Granules in Saccharomyces cerevisiae. 2010.

[124] R. Lowry. Concepts and applications of inferential statistics. VassarStats: Web

Site for Statistical Computation, 2005.

[125] B. John, C. Sander, D.S. Marks, et al. Prediction of human microRNA targets.

METHODS IN MOLECULAR BIOLOGY-CLIFTON THEN TOTOWA-, 342:101,

2006.

[126] P. Landgraf, M. Rusu, R. Sheridan, A. Sewer, N. Iovino, A. Aravin, S. Pfeffer,

A. Rice, A.O. Kamphorst, M. Landthaler, et al. A mammalian microRNA expres-

sion atlas based on small RNA library sequencing. Cell, 129(7):1401–1414, 2007.

[127] M. H. Radfar, W. Wong, and Q. Morris. Baymir: inferring evidence for endoge-

nous mirna-induced gene repression from mrna expression profiles. BMC Genomic,

21(23):3135–3148, 2013.

BIBLIOGRAPHY 119

[128] J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized

linear models via coordinate descent. Journal of statistical software, 33(1):1, 2010.

[129] G.L. Papadopoulos, M. Reczko, V.A. Simossis, P. Sethupathy, and A.G. Hatzige-

orgiou. The database of experimentally supported targets: a functional update of

tarbase. Nucleic acids research, 37(suppl 1):D155–D158, 2009.

[130] I. Ulitsky, L.C. Laurent, and R. Shamir. Towards computational prediction of

microrna function and activity. Nucleic acids research, 38(15):e160–e160, 2010.

[131] R. C. Friedman et al. Most mammalian mRNAs are conserved targets of mi-

croRNAs. Genome Res., 19:92–105, 2009.

[132] Yuji Funakoshi, Yusuke Doi, Nao Hosoda, Naoyuki Uchida, Masanori Osawa, Ichio

Shimada, Masafumi Tsujimoto, Tsutomu Suzuki, Toshiaki Katada, and Shin-ichi

Hoshino. Mechanism of mrna deadenylation: evidence for a molecular interplay

between translation termination factor erf3 and mrna deadenylases. Genes & de-

velopment, 21(23):3135–3148, 2007.

[133] J. Friedman, T. Hastie, H. Hofling, and R. Tibshirani. Pathwise coordinate opti-

mization. The Annals of Applied Statistics, 1(2):302–332, 2007.

[134] M. Lukk, M. Kapushesky, J. Nikkila, H. Parkinson, A. Goncalves, W. Huber,

E. Ukkonen, and A. Brazma. A global map of human gene expression. Nature

biotechnology, 28(4):322–324, 2010.

[135] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical

and powerful approach to multiple testing. Journal of the Royal Statistical Society.

Series B (Methodological), pages 289–300, 1995.

[136] C. Cheng and L.M. Li. Inferring microrna activities by combining gene expression

with microrna target prediction. PLoS One, 3(4):e1989, 2008.

BIBLIOGRAPHY 120

[137] C. Cheng, X. Fu, P. Alves, M. Gerstein, et al. mrna expression profiles show

differential regulatory effects of micrornas between estrogen receptor-positive and

estrogen receptor-negative breast cancer. Genome Biol, 10(9):R90, 2009.

[138] Z. Liang, H. Zhou, Z. He, H. Zheng, and J. Wu. miract: a web tool for evaluating

microrna activity based on gene expression data. Nucleic acids research, 39(suppl

2):W139–W144, 2011.

[139] P. Alexiou, M. Maragkakis, G.L. Papadopoulos, V.A. Simmosis, L. Zhang, and

A.G. Hatzigeorgiou. The diana-mirextra web server: from gene expression data to

microrna function. PLoS One, 5(2):e9171, 2010.

[140] K. Le Brigand, K. Robbe-Sermesant, B. Mari, and P. Barbry. Mirontop: min-

ing micrornas targets across large scale gene expression studies. Bioinformatics,

26(24):3131–3132, 2010.

[141] S. Volinia, R. Visone, M. Galasso, E. Rossi, and C.M. Croce. Identification of

microrna activity by targets’ reverse expression. Bioinformatics, 26(1):91–97, 2010.

[142] A. Arora and D.A.C. Simpson. Individual mrna expression profiles reveal the effects

of specific micrornas. Genome biology, 9(5):R82, 2008.

[143] Z. Yu, Z. Jian, S.H. Shen, E. Purisima, and E. Wang. Global analysis of mi-

crorna target gene expression reveals that mirna targets are lower expressed in

mature mouse and drosophila tissues than in the embryos. Nucleic acids research,

35(1):152–164, 2007.

[144] Z. Liang, H. Zhou, H. Zheng, J. Wu, Z. Liang, H. Zhou, H. Zheng, J. Wu, et al.

Expression levels of micrornas are not associated with their regulatory activities.

Biology direct, 6(1):1–4, 2011.

BIBLIOGRAPHY 121

[145] V. Jayaswal, M. Lutherborrow, and Y.H. Yang. Measures of association for iden-

tifying microrna-mrna pairs of biological interest. PloS one, 7(1):e29612, 2012.

[146] J. Lu and A.G. Clark. Impact of microrna regulation on variation in human gene

expression. Genome Research, 22(7):1243–1254, 2012.

[147] Thomas Derrien, Rory Johnson, Giovanni Bussotti, Andrea Tanzer, Sarah Dje-

bali, Hagen Tilgner, Gregory Guernec, David Martin, Angelika Merkel, David G

Knowles, et al. The gencode v7 catalog of human long noncoding rnas: Analysis of

their gene structure, evolution, and expression. Genome research, 22(9):1775–1789,

2012.

[148] Qing-Fei Yin, Li Yang, Yang Zhang, Jian-Feng Xiang, Yue-Wei Wu, Gordon G

Carmichael, and Ling-Ling Chen. Long noncoding rnas with snorna ends. Molecular

Cell, 2012.

[149] Ido Amit Mitchell Guttman, Manuel Garber, Courtney French, Michael F Lin,

David Feldser, Maite Huarte, Or Zuk, Bryce W Carey, John P Cassady, Moran N

Cabili, et al. Chromatin signature reveals over a thousand highly conserved large

non-coding rnas in mammals. Nature, 458(7235):223–227, 2009.

[150] Chenguang Gong and Lynne E Maquat. lncrnas transactivate stau1-mediated mrna

decay by duplexing with 3 [prime] utrs via alu elements. Nature, 470(7333):284–288,

2011.

[151] Liran Juan, Guohua Wang, Milan Radovich, Bryan P Schneider, Susan E Clare,

Yadong Wang, and Yunlong Liu. Potential roles of micrornas in regulating long

intergenic noncoding rnas. BMC medical genomics, 6(Suppl 1):S7, 2013.

[152] Ashwini Jeggari, Debora S Marks, and Erik Larsson. mircode: a map of puta-

tive microrna target sites in the long non-coding transcriptome. Bioinformatics,

28(15):2062–2063, 2012.

BIBLIOGRAPHY 122

[153] Jiangchao Li, Xiaodong Li, Yan Li, Hong Yang, Lijing Wang, Yanru Qin, Haibo

Liu, Li Fu, and Xin-Yuan Guan. Cell-specific detection of mir-375 downregulation

for predicting the prognosis of esophageal squamous cell carcinoma by mirna in

situ hybridization. PloS one, 8(1):e53582, 2013.

[154] M Nielsen, JH Hansen, J Hedegaard, RO Nielsen, F Panitz, C Bendixen, and

B Thomsen. Microrna identity and abundance in porcine skeletal muscles deter-

mined by deep sequencing. Animal genetics, 41(2):159–168, 2010.

[155] Federica Collino, Maria Chiara Deregibus, Stefania Bruno, Luca Sterpone, Giulia

Aghemo, Laura Viltono, Ciro Tetta, and Giovanni Camussi. Microvesicles derived

from adult human bone marrow and tissue specific mesenchymal stem cells shuttle

selected pattern of mirnas. PLoS One, 5(7):e11803, 2010.

[156] Ruotian Li, Guijun Yan, Qiaoling Li, Haixiang Sun, Yali Hu, Jianxin Sun, and

Biao Xu. Microrna-145 protects cardiomyocytes against hydrogen peroxide (h2o2)-

induced apoptosis through targeting the mitochondria apoptotic pathway. PloS

one, 7(9):e44907, 2012.

[157] Carla P Concepcion, Yoon-Chi Han, Ping Mu, Ciro Bonetti, Evelyn Yao, Aleco

D’Andrea, Joana A Vidigal, William P Maughan, Paul Ogrodowski, and Andrea

Ventura. Intact p53-dependent responses in mir-34–deficient mice. PLoS Genetics,

8(7):e1002797, 2012.

[158] Li Wang, Xin Chen, Yanyan Zheng, Fen Li, Zheng Lu, Chen Chen, Jin Liu,

Yu Wang, Yajing Peng, Zhongliang Shen, et al. Mir-23a inhibits myogenic differen-

tiation through down regulation of fast myosin heavy chain isoforms. Experimental

Cell Research, 2012.

[159] Mariana Lagos-Quintana, Reinhard Rauhut, Abdullah Yalcin, Jutta Meyer, Win-

BIBLIOGRAPHY 123

fried Lendeckel, and Thomas Tuschl. Identification of tissue-specific micrornas from

mouse. Current Biology, 12(9):735–739, 2002.

[160] Juanjie Bo, Guoliang Yang, Kailing Huo, Haifeng Jiang, Lianhua Zhang, Dongming

Liu, and Yiran Huang. microrna-203 suppresses bladder cancer development by

repressing bcl-w expression. Febs Journal, 278(5):786–792, 2011.

[161] Shui-Long Guo, Zheng Peng, Xue Yang, Kai-Ji Fan, Hui Ye, Zhen-Hua Li, Yan

Wang, Xiao-Li Xu, Jun Li, You-Liang Wang, et al. mir-148a promoted cell prolif-

eration by targeting p27 in gastric cancer cells. International journal of biological

sciences, 7(5):567, 2011.

[162] Lawrence S Hon, Zemin Zhang, et al. The roles of binding site arrangement and

combinatorial targeting in microrna repression of gene expression. Genome Biol,

8(8):R166, 2007.

[163] Li Huang, Junhua Luo, Qingqing Cai, Qiuhui Pan, Hong Zeng, Zhenghui Guo, Wen

Dong, Jian Huang, and Tianxin Lin. Microrna-125b suppresses the development

of bladder cancer by targeting e2f3. International Journal of Cancer, 128(8):1758–

1769, 2011.

[164] Marcella Cesana, Davide Cacchiarelli, Ivano Legnini, Tiziana Santini, Olga

Sthandier, Mauro Chinappi, Anna Tramontano, and Irene Bozzoni. A long noncod-

ing rna controls muscle differentiation by functioning as a competing endogenous

rna. Cell, 147(2):358–369, 2011.

[165] Jinong Feng, Guihua Sun, Jin Yan, Katie Noltner, Wenyan Li, Carolyn H Buzin,

Jeff Longmate, Leonard L Heston, John Rossi, and Steve S Sommer. Evidence

for x-chromosomal schizophrenia associated with microrna alterations. PLoS One,

4(7):e6121, 2009.

BIBLIOGRAPHY 124

[166] Iris Pinheiro, Lien Dejager, and Claude Libert. X-chromosome-located micrornas in

immunity: Might they explain male/female differences? Bioessays, 33(11):791–802,

2011.

[167] Ian Dunham, Ewan Birney, Bryan R Lajoie, Amartya Sanyal, Xianjun Dong,

Melissa Greven, Xinying Lin, Jie Wang, Troy W Whitfield, Jiali Zhuang, et al.

An integrated encyclopedia of dna elements in the human genome. 2012.

computational prediction of target genes of micrornas · computational prediction of target genes...

Documents