the long intergenic non-coding rna (lincrna) … · 26 long intergenic non-coding rnas (lincrnas)...

34
1 Short title: 1 Long non-coding RNA landscape of soybean genome 2 Corresponding authors: 3 Prem L. Bhalla 4 Email: [email protected] 5 Tel. +61 03 8344 9651 6 Full title: 7 The long intergenic non-coding RNA (lincRNA) landscape of the 8 soybean genome 9 Authors: 10 Agnieszka A. Golicz 1 11 Mohan B. Singh 1 12 Prem L. Bhalla 1 13 1 Plant Molecular Biology and Biotechnology Laboratory, Faculty of Veterinary and 14 Agricultural Sciences, University of Melbourne, Parkville, Melbourne, VIC, Australia. 15 One sentence summary: 16 The soybean genome encodes over 6,000 long intergenic non-coding RNAs implicated in 17 many biological processes including transcription, development and possibly influencing 18 agronomic traits. 19 Author contributions: 20 AAG: Designed the experiments, performed the analysis, wrote the manuscript; MBS: 21 Conceived research, wrote the manuscript; PLB: Conceived research, wrote the manuscript 22 Funding: Financial support for this work was obtained from the ARC Discovery Grant ARC 23 DP0988972 24 Plant Physiology Preview. Published on December 28, 2017, as DOI:10.1104/pp.17.01657 Copyright 2017 by the American Society of Plant Biologists www.plantphysiol.org on July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

Upload: lexuyen

Post on 11-Jun-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

1

Short title: 1

Long non-coding RNA landscape of soybean genome 2

Corresponding authors: 3

Prem L. Bhalla 4

Email: [email protected] 5

Tel. +61 03 8344 9651 6

Full title: 7

The long intergenic non-coding RNA (lincRNA) landscape of the 8

soybean genome 9

Authors: 10

Agnieszka A. Golicz1 11

Mohan B. Singh1 12

Prem L. Bhalla1 13

1Plant Molecular Biology and Biotechnology Laboratory, Faculty of Veterinary and 14

Agricultural Sciences, University of Melbourne, Parkville, Melbourne, VIC, Australia. 15

One sentence summary: 16

The soybean genome encodes over 6,000 long intergenic non-coding RNAs implicated in 17

many biological processes including transcription, development and possibly influencing 18

agronomic traits. 19

Author contributions: 20

AAG: Designed the experiments, performed the analysis, wrote the manuscript; MBS: 21

Conceived research, wrote the manuscript; PLB: Conceived research, wrote the manuscript 22

Funding: Financial support for this work was obtained from the ARC Discovery Grant ARC 23

DP0988972 24

Plant Physiology Preview. Published on December 28, 2017, as DOI:10.1104/pp.17.01657

Copyright 2017 by the American Society of Plant Biologists

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

2

Abstract 25

Long intergenic non-coding RNAs (lincRNAs) are emerging as important regulators of 26

diverse biological processes. However, our understanding of lincRNA abundance and 27

function remains very limited especially for agriculturally important plants. Soybean is a 28

major legume crop plant providing over a half of global oilseed production. Moreover, 29

soybean can form symbiotic relationships with Rhizobium bacteria to fix atmospheric 30

nitrogen. Soybean has a complex paleopolyploid genome and exhibits many vegetative and 31

floral development complexities. Soybean cultivars have photoperiod requirements restricting 32

its use and productivity. Molecular regulators of these legume-specific developmental 33

processes remain enigmatic. Long non-coding RNAs may play important regulatory roles in 34

soybean growth and development. In this study over one billion RNASeq read pairs from 37 35

samples representing nine tissues were used to discover 6,018 lincRNA loci. The lincRNAs 36

were shorter than protein-coding transcripts, had lower expression levels and more sample 37

specific expression. Few of the loci were found to be conserved in two other legume species 38

(chickpea and Medicago), but almost two hundred homeologous lincRNA in the soybean 39

genome were detected. Protein-coding gene-lincRNA co-expression analysis suggested an 40

involvement of lincRNAs in stress response, signal transduction and developmental 41

processes. Positional analysis of lincRNA loci implicated involvement in transcriptional 42

regulation. lincRNA expression from centromeric regions was observed especially in actively 43

dividing tissues suggesting possible roles in cell division. Integration of publicly available 44

genome-wide association data with the lincRNA map of the soybean genome uncovered 23 45

lincRNAs potentially associated with agronomic traits. 46

Introduction 47

Recently, it has been elucidated that eukaryotic genomes, including plant genomes, encode a 48

multitude of non-coding RNAs (ncRNAs) (Chekanova et al., 2007; Kapranov et al., 2007). 49

One class of ncRNAs are long non-coding RNAs (lncRNAs), which are defined as transcripts 50

>200 bp in length and harbouring no discernible coding potential (Jin et al., 2013; Wang et 51

al., 2014; Chekanova, 2015). The relative location of lncRNAs to protein-coding genes 52

identifies a further subgroup known as long intergenic noncoding RNAs – lincRNAs, which 53

do not overlap protein-coding genes. LncRNAs were long considered little beyond 54

transcriptional noise; however current evidence points to important roles in diverse biological 55

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

3

processes across eukaryotes (Van Werven et al., 2012; Ulitsky and Bartel, 2013; Flynn and 56

Chang, 2014). In Arabidopsis and rice s have been shown to be involved in flowering time 57

regulation, reproduction and root organogenesis (Swiezewski et al., 2009; Cifuentes-Rojas et 58

al., 2011; Heo and Sung, 2011; Ariel et al., 2014; Bardou et al., 2014; Matzke et al., 2014; 59

Wang et al., 2014; Zhang et al., 2014; Berry and Dean, 2015; Khemka et al., 2016). LncRNA 60

are found both in the nucleus and cytoplasm, which suggests a diversity of modes of action, 61

including chromatin modification (Heo et al., 2013), acting as decoys preventing access of 62

regulatory proteins, including splicing machinery, and miRNAs to their true RNA and DNA 63

targets (Franco-Zorrilla et al., 2007; Wu et al., 2013; Bardou et al., 2014) and acting as 64

scaffolds for assembly of larger protein-RNA complexes (Lai et al., 2013; Pefanis et al., 65

2015). Recently, a large number of lncRNAs has been found to be associated with ribosomes 66

and co-expressed with ribosomal proteins, although not translated, which suggests possible 67

roles in translation regulation (Carlevaro-Fita et al., 2016). Despite increasing efforts (Jin et 68

al., 2013; Szcześniak et al., 2016) plant specific lncRNA databases are scarce, and lncRNA 69

genome wide discovery and especially functional annotation in agriculture important plant 70

species remain unavailable. 71

Legumes are a large family of plant species characterized by butterfly-like flowers and pod-72

shaped shaped fruits. They provide an invaluable contribution to ecosystems due to their 73

ability to form symbiotic relationships with Rhizobium bacteria. This symbiosis results in 74

dinitrogen capture from the air and its subsequent fixation, making legumes one of the major 75

sources of bioavailable nitrogen. Legume seeds are a second, after cereals, source of human 76

and animal food and include soybeans (Glycine max), peanuts (Arachis hypogaea), garden 77

peas (Pisum sativum), and broad beans (Vici faba). Additionally, soybean is responsible for 78

over a half of global oilseed production. Due to its economic importance as a source of food 79

and oils soybean has increasingly become a target of genomic and transcriptomic research 80

efforts. Sequencing of the soybean genome revealed its complex paleopolyploid structure 81

(Schmutz et al., 2010). Although comparison between soybean and the model plant species 82

Arabidopsis thaliana can be drawn, the two species are suggested to have diverged from a 83

common ancestor 92 million years ago (Zhu et al., 2003) and soybean has undergone at least 84

two genome duplication events resulting in homeologous relationships between 85

chromosomes and gene loci (Shoemaker et al., 2006). One of the most interesting questions 86

from the genomics point of view is ‘Which genomic features of soybean define its 87

characteristics and are responsible for its vegetative and floral complexities? Considering that 88

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

4

many of the key developmental control genes in soybean exist in multiple copies, a complex 89

interplay and additional control for ‘fine-tuning’ are expected. Our recent awareness of 90

prevalence and importance of long non-coding RNAs highlight that these may play important 91

regulatory roles in soybean growth and development. lncRNAs could provide the additional 92

level of control and signal integration, which is missing when only protein-coding genes are 93

considered. 94

This study presents first genome-wide discovery, characterization and functional annotation 95

of long intergenic non-coding RNAs in the soybean genome. Genome-wide lincRNA 96

discovery was performed using a combination of de novo, and reference guided assembly 97

approaches generating a most comprehensive lincRNA database. Comparative analysis 98

between soybean lincRNAs and other legume species was performed to identify lincRNAs 99

which could play universal roles in all legumes and the lincRNAs which are soybean-100

specific. Functional analysis was conducted to uncover biological processes that could be 101

influenced by lincRNA action. Finally, publicly available genome-wide association data was 102

used to further characterize the lincRNAs discovered and find potential links to agronomic 103

traits. 104

Results and discussion 105

Genome-wide discovery of 6,018 long non-coding intergenic loci 106

lincRNAs are a class of RNA molecules, which are >200 bp long and have no discernible 107

coding potential. High throughput technologies offer an opportunity for both coding and 108

noncoding transcript detection and quantification. In total 1,025,323,161 read pairs from 37 109

soybean samples were used in the analysis. The soybean sampled tissues included 28 samples 110

representing stem (germination and trefoil stage), flower (flower bud, unopened flower, 111

florescence and five days after flowering), leaf bud (germination, trefoil and differentiation 112

stage), leaf (trefoil, flower bud differentiation stage and senescent leaves), pod (three, four 113

and five weeks), seed (three, five, six, eight and 10 weeks) seed and pod (two, three and four 114

weeks), shoot meristem (flower bud differentiation stage), cotyledon (germination and trefoil 115

stage) and root (Shen et al., 2014). Additionally, nine samples (four from leaf tissue and five 116

from shoot apical meristem tissue (SAM)) representing time points during the floral transition 117

period following short day treatment (Wong et al., 2013) were used. Both de novo and 118

reference guided transcriptome assembly strategies were applied. StringTie reference guided 119

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

5

assembly resulted in 68,190 loci and 160,337 transcripts. Trinity assemblies rendered 120

448,338 transcripts using de novo and 337,955 transcripts using reference guided approach. 121

The PASA comprehensive transcript database built using StringTie and Trinity assemblies 122

comprised 147,825 loci and 293,537 transcripts. Both StringTie and PASA annotations were 123

subjected to lincRNA discovery pipeline and PASA derived lincRNAs, which did not appear 124

in StringTie annotation were used to supplement StringTie derived lincRNAs (Fig. S1). Loci 125

were considered to encode lincRNAs if they did not produce any protein-coding transcripts 126

(ORF size ≤ 100 amino acids and no similarity to protein-coding genes) and did not overlap 127

any protein-coding loci. The lincRNAs were filtered to remove loci producing transcripts 128

with similarity to tRNAs, rRNAs and snoRNAs (58 loci) found in Rfam database, transcripts 129

which were nested (entirety contained) within other lincRNAs (63 loci), transcripts which 130

overlapped protein-coding genes in Gmax_275_v2.0 genome annotation (126 loci). 131

lincRNAs are known to be expressed at low levels (Li et al., 2014; Zhang et al., 2014; Hao et 132

al., 2015). Choosing an expression cut-off requires balancing a trade-off between retaining 133

the largest possible set of lincRNAs and discarding the spurious transcription and mapping 134

artefacts. Two lincRNA sets were generated. The larger set (9,766 loci) with a permissive 135

cut-off >0.1 FPKM in a least one of the samples (Table S2) and a filtered set generated using 136

more stringent FPKM cut-off (≥1.0 FPKM in at least one of samples or ≥0.5 FPKM in at 137

least two samples or gene size of at least 1000 bp). The filtered lincRNA set consisted of 138

6,018 lincRNA loci (6,134 transcripts), including 3,435 StringTie derived and 2,583 PASA 139

derived loci (Table S3). The full set is provided for the benefit of the readers but only the 140

filtered lincRNA set was used in the analysis. 141

lincRNAs have distinct properties when compared to protein-coding genes 142

The lincRNA and protein-coding loci were examined for main gene characteristics. The 143

lincRNA transcripts were on average shorter than protein-coding transcripts (Fig. 1A). The 144

median length of lincRNA transcripts was 320 bp (mean: 467.3 bp), whereas the median 145

length of protein-coding transcripts was 3,657 bp (mean: 4,450 bp). The lincRNA transcripts 146

contained a lower number of exons than protein-coding transcripts (Fig. 1C). The majority of 147

lincRNA transcripts (90.3%) contained a single exon. The maximum number of exons found 148

in a lincRNA transcript was 4. The lincRNA genes had a lower number of isoforms compared 149

to protein-coding genes (Fig. 1B). A vast majority of lincRNA genes (98.5%) had a single 150

isoform. Finally, lincRNAs showed lower overall expression levels compared to coding 151

genes (Fig. 1D). The observations are consistent with lincRNA studies in other plant species. 152

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

6

lincRNAs in rice, cucumber, and chickpea were reported to be shorter than protein-coding 153

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

7

genes (Zhang et al., 2014; Hao et al., 2015; Khemka et al., 2016). lincRNAs in cucumber, 154

maize, and chickpea were reported to have predominantly one exon only (Li et al., 2014; 155

Hao et al., 2015; Khemka et al., 2016). Also, low expression levels of lincRNAs were 156

observed in Arabidopsis, rice and maize (Liu et al., 2012; Li et al., 2014; Zhang et al., 157

2014; Hao et al., 2015). Although usually lacking sequence homology (Hao et al., 2015; 158

Mohammadin et al., 2015; Wang et al., 2015) lincRNAs appear to share similar 159

characteristics across different species which include short length, a low number of exons 160

and splice variants. 161

Centromeric regions of soybean chromosomes show lincRNA expression 162

The distribution of lincRNAs across chromosomes can provide clues regarding possible 163

functions and mechanisms of action. For example, lincRNAs located among protein-164

coding genes could modulate expression of their neighbours, while lincRNAs found close 165

to centromeres or in gene deserts may act distally or have additional roles. Centromeric 166

regions of soybean chromosomes are enriched in transposable elements (TEs) and depleted 167

in protein-coding loci (Schmutz et al., 2010). In contrast, lincRNA loci display an even 168

distribution across chromosomes (Fig. 1E), with active transcription from centromeric 169

regions. lincRNAs transcribed from centromeric regions have been implicated to play roles 170

in centromere maintenance and cellular division (Rošić and Erhardt, 2016). In total, 32 171

centromeric (as defined by transcription from regions delimited by GmCent-1 and 172

GmCent-2 repeats) lincRNAs on chromosomes 1,3,5,7,13,16,17,19 were identified. The 173

number of lincRNAs identified was weakly positively correlated with the identified 174

centromere size (rho=0.25). No centromeric lincRNAs were expressed in all the samples, 175

and the median number of samples showing centromeric lincRNA expression was seven. 176

The median expression value in samples which expressed centromeric lincRNA (FPKM > 177

0.1) was 0.31 FPKM. Centromeric lincRNAs showed higher transcriptional activity in 178

actively dividing tissues (flower bud, leaf bud and shoot apical meristem, Mann-Witney U 179

test, p-value < 0.01), (Fig. 2B). The most common transposable element type found within 180

centromeric lincRNAs was LTR Gypsy retrotransposon (Fig. 2C) which is consistent with 181

high prevalence of Gypsy transposable elements in the vicinity of centromeres (Schmutz et 182

al., 2010). 183

Although centromeric lincRNA expression was observed, similarly to rice and maize 184

(Wang et al., 2015), the majority of lincRNAs were found relatively close to neighbouring 185

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

8

protein-coding genes. The median distance from lincRNA to protein-coding gene was 186

1,064 bp (mean distance: 3,497 bp). LincRNAs found a short distance from protein-coding 187

genes could modulate their expression by actively recruiting activators, repressors, 188

epigenetic modifiers or simply by transcription from the lincRNA locus (Wang and Chang, 189

2011; Kornienko et al., 2013). 190

Nearly a fifth of lincRNA transcripts has sequence similarity to transposable elements 191

The relatively high abundance of lincRNAs proximal to centromeres sparked an 192

investigation of the contribution of transposable elements to lincRNA transcript 193

composition. In total, 18.3% of lincRNA transcripts were predicted to harbor TEs, and a 194

higher proportion of lincRNAs than coding transcripts (10.8%) contained TEs. For 195

transcripts, which harboured TEs, TEs contributed a larger amount of sequence to 196

lincRNAs (median lincRNA coverage by TEs was 100%, mean: 82.8%) than to protein-197

coding transcripts (median coding transcript coverage by TEs: 19%, mean: 36.5%). A 198

similar pattern was observed in the human genome, where 2/3 of mature noncoding 199

transcripts showed similarity to TEs and TEs were found to contribute signals essential for 200

biogenesis of many lncRNAs (Kapusta et al., 2013). The lincRNAs were found harbor 201

more retrotransposons than DNA transposons (Fig. 2A), which reflects the overall TE 202

landscape of the soybean genome (Du et al., 2010; Schmutz et al., 2010). 203

Soybean lincRNAs have low levels of sequence and positional conservation in chickpea 204

and Medicago 205

Information on about conservation of lincRNAs across species can provide further inputs 206

regarding their possible functions and the processes they are involved in. If a lincRNA is well 207

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

9

conserved in a number of species, it can be assumed to play a generally important role. 208

Conversely, if a lincRNA is species specific, it may play a role unique to given organism or 209

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

10

provide a modulatory function which alters the otherwise conserved system. It has been noted 210

that the sequence conservation of lincRNAs is much lower than protein-coding genes (Hao et 211

al., 2015; Mohammadin et al., 2015), but higher levels of positional based conservation have 212

been postulated (Mohammadin et al., 2015; Wang et al., 2015). In total 6,018 soybean, 2,248 213

chickpea, 5,794 Medicago, and 6,480 Arabidopsis lincRNA were available for analysis. 214

Reciprocal best BLAST (RBB) comparison uncovered 143 soybean lincRNAs which have 215

sequence similarity to lincRNA in other species, with 4 lincRNAs showing similarity to 216

lincRNAs in both chickpea and Medicago. Because different tissue samples and discovery 217

pipelines were used, it is possible that some conserved lincRNA pairs were missed. To 218

address this soybean lincRNAs were compared against full genome assemblies which 219

resulted in the discovery of 787 additional loci with sequence similarity to genomes of other 220

species (Fig. 3A). Those could correspond to un-annotated non-coding transcripts. However, 221

in the absence of evidence of transcription, their function remains unknown, and those loci 222

were not be considered in further analysis. 223

Positional conservation between lncRNA loci has been suggested to extend across longer 224

evolutionary distances than sequence conservation (Mohammadin et al., 2015; Wang et al., 225

2015). A long non-coding RNA is often considered positionally conserved if found in the 226

same orientation (upstream or downstream) relative to orthologous protein-coding gene in at 227

least two species (Mohammadin et al., 2015; Wang et al., 2015). If the direction of 228

transcription of lincRNA is known, transcription from the same strand is also required. 229

However, it is conceivable that if a large number of lincRNAs is considered, a number of 230

those will show positional similarity across species (found in the same orientation relative to 231

protein-coding genes) by chance only, rather than as a result of evolutionary conservation. To 232

test this, the number of soybean lincRNAs which had positional similarity with chickpea, 233

Medicago, and A. thaliana lincRNAs was compared with control datasets constructed by 234

random redistribution of lincRNAs across genomes of all four species. Two properties of 235

lincRNA loci were considered while constructing the control datasets: (1) A proportion of 236

lincRNA loci is found in clusters of two or more loci (mirroring this property the in control 237

datasets will result in a more realistic distribution of lincRNA), (2) lincRNA loci are enriched 238

proximal to transcription factors (uneven distribution of lincRNAs relative to transcription 239

factors could affect the results if transcription factors are preferentially retained or lost from 240

syntenic regions). To accommodate those, four types of control datasets (5 simulations each, 241

20 datasets in total) were constructed: (1) Random re-distribution of lincRNA relative to 242

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

11

protein-coding loci, (2) Random re-distribution of lincRNA relative to protein-coding loci, 243

but maintaining the proportion of lincRNA found adjacent to transcription factors (3) 244

Random re-distribution of existing lincRNA clusters relative to protein-coding loci, (4) 245

Random re-distribution of existing lincRNA clusters relative to protein-coding loci, but 246

maintaining the proportion of lincRNA found adjacent to transcription factors. The true 247

biological lincRNA dataset and the simulated control datasets were analysed using the same 248

positional conservation discovery pipeline. The number of positionally similar lincRNAs in 249

the biological dataset (1,201) and the simulation datasets 1 and 2 were not significantly 250

different (Fisher test, p-value > 0.01 for majority comparisons, Fig. 3B). However, more 251

positionally similar lincRNAs were found in the biological dataset when compared to 252

simulation datasets 3 and 4 (Fisher test, p-value < 0.01 for all comparisons). Although, 253

comparison with simulation datasets 3 and 4 suggest that positional similarity observed is 254

somewhat higher than by chance alone the difference is not large (Fig. 3B). Results of 255

analyses of positional conservation ought to be interpreted with caution, especially across 256

larger evolutionary distances, and considered in conjunction with sequence similarity and 257

analysis of expression patterns. 258

Strong support for positional conservation of lincRNAs rather than chance positional 259

similarity would be any sequence similarity between transcripts. Comparison of positionally 260

similar transcript pairs uncovered 48 soybean lincRNAs which show positional similarity and 261

sequence similarity with other species. Sequence comparison of the positionally similar 262

lincRNA pairs in simulated datasets (100 simulated datasets using random re-distribution of 263

lincRNA relative to protein-coding loci) showed them to have no sequence similarity (median 264

number of pairs with sequence similarity per dataset: 0), suggesting that the sequence 265

similarity observed was not due to chance alone (permutation test, p-value < 0.01). 266

Subsequently, the 48 loci were analysed in more detail. Protein-coding genes and short RNA 267

primary transcripts are known to have higher conservation levels than long non-coding RNAs 268

(Hezroni et al., 2015). Some lncRNAs are known to be sRNA precursors. The 48 putative 269

conserved lincRNAs were inspected to check whether they (1) show similarity to 270

transposable elements (TEs), (2) could encode small conserved peptides which would be 271

missed by the lincRNA discovery pipeline (peptides < 100 aa and with no similarity to 272

proteins as evaluated by blastx) or (3) could be sRNA precursors. They were also compared 273

against NCBI RefSeq-RNA database to check for similarity with any other known ncRNAs. 274

Only one of the 48 lincRNAs had similarity to TEs (Table S5). Three of the 48 lincRNAs had 275

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

12

significant similarity to tens of sequences annotated as ncRNAs in RefSeq database. 276

However, a detailed analysis of the homologous region revealed it to contain a short 25 277

amino acid open reading frame (ORF) encoding a peptide RPL41 (ribosomal protein L41), 278

which was embedded within a much longer transcript. Because of the short length of the 279

peptide, transcripts carrying RPL41 were annotated as lncRNAs by the pipeline used in this 280

study as well as NCBI annotation pipeline. Following this discovery, the entire lincRNA 281

dataset was re-analysed to check for presence of other RPL41 ORFs. However, only 5 282

lincRNA loci (including the three conserved ones) were carrying RPL41 ORF. This finding 283

does suggest that some of the transcripts classified as lncRNAs based on the discovery 284

algorithm parameters used could, in fact, encode small peptides (Niazi and Valadkhan, 2012; 285

Ruiz-Orera et al., 2014; Nelson et al., 2016). Short of extremely well-conserved examples 286

like RPL41, in the absence of proteomic data these are impossible to discern. Six of the 287

lincRNAs showed 100% percentage identity to microRNA, suggesting that they could be 288

precursors of short RNAs. Re-analysis of the whole lincRNA dataset suggested that 56 of 289

lincRNAs could be microRNA precursors and the microRNA precursors were over-290

represented in the positionally and sequence conserved lincRNAs (Fisher test, p-value < 291

0.01). Finally, 19 lincRNAs showed similarity to other lncRNA transcripts in RefSeq and 292

those represented other species. 293

Almost two hundred of homeologous lincRNA loci can be traced to a soybean-lineage 294

specific whole genome duplication which occurred ~13 MYA 295

Soybean genome has a paleopolyploid structure resulting in extensive homeology across 296

chromosomes (Shoemaker et al., 2006). It has undergone two rounds of whole genome 297

duplications, a more ancient event which occurred ~59 million years ago (MYA) and 298

soybean-lineage-specific paleotetraploidization which took place ~13 MYA. As a result, 299

soybean genome is composed of large blocks of homeologous regions (Schmutz et al., 2010). 300

It is possible that akin to protein-coding loci, homeologous lincRNA loci in soybean genome 301

exist. Following a similar procedure for lincRNA positional similarity analysis performed 302

between species, analysis of positional similarity of lincRNA loci within soybean genome 303

was performed. Again, control datasets 1, 2, 3 and 4 were used to compare the number of 304

positionally similar lincRNA loci found to the number which would be expected by chance 305

alone. The number of positionally similar lincRNA loci in the true biological dataset was 306

significantly larger than the number found in any of the control datasets (Fisher test, p-value 307

< 0.01 for all comparisons, Fig. 3C). The difference between biological and control datasets 308

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

13

suggested that at least 200 to 300 lincRNA loci with homeologs in the soybean genome were 309

to be expected. Sequences of the lincRNA pairs with positional similarity within soybean 310

genome were compared, which allowed identification of 103 pairs of homeologous loci 311

(Table S6). Sequence comparison of the positionally similar lincRNA pairs in simulated 312

datasets (100 simulated datasets using random re-distribution of lincRNA relative to protein-313

coding loci) showed them to have no sequence similarity (median number of pairs with 314

sequence similarity per dataset: 0), again suggesting that that the sequence similarity 315

observed was not due to chance alone (permutation test, p-value < 0.01). The number also 316

roughly corresponds to the predictions based on comparison of positional similarity in 317

biological and control datasets. 318

The age of homeologous blocks can be established using pairwise synonymous distance 319

(Ks values) of paralogues (Schlueter et al., 2004; Pfeil et al., 2005; Schmutz et al., 2010). In 320

case of soybean the Ks values of 0.06–0.39 correspond to 13-Myr genome duplication and 321

the Ks values of values of 0.40–0.80 to the 59-Myr genome duplication (Schmutz et al., 322

2010). The vast majority of Ks values of protein-coding gene pairs flanking homeologous 323

lincRNA loci fall within the 0.06–0.39 range (Fig. 3D), suggesting a more recent origin 324

resulting from the soybean-lineage-specific paleotetraploidization. It is also possible that that 325

some homeologous loci representing the ~59 MYA duplication do exist, but sequence 326

divergence prevents their identification. Taken together, results of inter- and intra-species 327

comparisons suggest that while a life-span of soybean lincRNA can exceed 15 MY it is 328

unlikely to extend over 60 MY. 329

Functional enrichment of proteins flanking homeologous loci revealed over-representation of 330

genes involved in response to abiotic stimuli including cellular response to phosphate 331

starvation and response to absence of light (Table S7). Finally, the co-expression of 332

homeologous lincRNA loci was significantly higher (Fig 3E, Mann-Whitney U test, p-value 333

< 0.01) when compared to a randomly selected lincRNA loci pairs, suggesting at least partial 334

conservation of expression patterns. 335

The lincRNAs show highly tissue specific expression 336

Expression of lincRNAs across all tissues was investigated using a combination of straight-337

forward counting method and Tau specificity index, which were recently shown to be most 338

successful methods of expression characterization (Kryuchkova-Mostacci and Robinson-339

Rechavi, 2017). lincRNAs displayed more tissue specific expression than protein-coding 340

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

14

genes (Fig. 4A). Any given lincRNAs was on average expressed in 8 samples (median: 6.0), 341

whereas any given protein-coding gene was on average expressed in 23 samples (median: 342

30). Only 27 lincRNAs were expressed in all the samples. The tissue with the highest number 343

of lincRNAs expressed (FPKM > 0.1) was floral tissue, followed by shoot apical meristem 344

and leaf, suggesting an active role of lincRNAs in flowering and developmental processes. 345

The sample with the highest number of lincRNAs expressed in total and uniquely was flower 346

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

15

bud (flower1; 1891 lincRNAs expressed in total, 51 expressed uniquely) (Fig. 4B, C, Table 347

S3). A large number of lincRNAs expressed in SAM are consistent with previous 348

observations in chickpea and other plants (Khemka et al., 2016). Overall, samples from the 349

same tissue show similar expression patterns (Fig 4D). Samples representing shoot apical 350

meristem (SAM), leaf, flower, and seed are grouped together. Nine of the samples from two 351

tissues (leaf and SAM) represent floral transition period following short day treatment. In 352

total 366 lincRNAs were uniquely expressed in the floral transition samples and of these 363 353

(99% of all lincRNAs) were expressed following short day treatment, with 89, 128 and 149 354

lincRNAs expressed in leaf only, SAM only and leaf and SAM respectively. These lincRNAs 355

represent an interesting target for the study of the mechanism of soybean floral transition. 356

The specificity of lincRNA expression can be better contextualized when compared with 357

different groups of protein-coding genes. The lincRNA tissue expression patterns were 358

compared with expression patterns of protein-coding genes representing different specificity 359

groups (transcription factors – high specificity, protein phosphorylation – medium specificity, 360

translation – low specificity). lincRNAs have higher tissue specificity than any of the protein-361

coding gene groups, but the expression pattern is closest to the transcription factors (Fig. 4E). 362

Transcription factors are known master regulators of gene expression and the parallels 363

observed can suggest similar roles of lincRNAs. The high tissue-specific lincRNA expression 364

supports the idea of their highly specialized, possible regulatory functions. It also allows for 365

the possibility of using lincRNAs as tissue type and state markers. 366

The lincRNA-protein-coding gene co-expression network and position of lincRNAs relative 367

to protein-coding neighbours allows functional annotation of non-coding RNAs 368

Functional annotation of long noncoding RNAs poses a considerable challenge. In case of 369

protein-coding genes often extensive information about the function of a gene in a model 370

organism is available, and sequence homology can be used to transfer existing annotation to 371

newly discovered loci. In the case of lincRNAs, very few functional assignments exist, and 372

lack of sequence homology hampers inter-species comparisons (Rinn and Chang, 2012; 373

Smith and Mattick, 2017). The primary form of annotation involves a construction of co-374

expression network and using a method of so-called ‘guilt-by-association’. Correlation of 375

expression between lincRNAs and protein-coding genes can imply involvement in common 376

biological processes. Spearman correlation between expression of lincRNA and protein-377

coding loci was calculated. Only significant correlations were used in the analysis (p-value 378

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

16

<0.05, p-value adjusted for multiple comparisons using method ‘holm’). The resulting 379

distribution of correlation coefficients is presented in Fig. S2A. The minimum absolute value 380

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

17

of correlation coefficient used in the analysis was 0.84. A higher number of positive than 381

negative correlations was observed and a large number of perfect correlations (rho=1) were 382

observed. A similar observation was made in a human lncRNA annotation project, noting a 383

higher number of positive correlations (Derrien et al., 2012). The high number of perfect 384

correlations being due to high tissue specificity of lincRNA expression. lincRNAs were 385

annotated using hub-based approach (Liao et al., 2011). GO enrichment analysis of protein-386

coding 1st-degree neighbours resulted in functional annotation of 1,574 lincRNAs (Table S8). 387

The summary of the GO annotation mapped to GOslim terms is presented in Fig. S2B. 388

Overall, lincRNAs are annotated with a range of functions including stress response, signal 389

transduction and DNA methylation. Genes which are specifically or highly expressed in a 390

given tissue, are considered likely to contribute to relevant biological processes (Boyle et al., 391

2017). Clustering of lincRNA based on their expression across tissues showed that genes 392

which have peak expression in a given tissue are likely to have overall similar expression 393

profiles (Fig. S3), implying involvement in common biological process. The lincRNAs have 394

been divided based on the tissue with peak expression (each set contained lincRNAs with 395

peak expression in a given tissue) and GO enrichment for each of the lincRNA sets (peak 396

expression in: cotyledon, shoot apical meristem, flower, leaf, leaf bud, pod, pod seed, seed, 397

stem, root, Fig. 5) was calculated. The enrichment of highly or specifically expressed 398

lincRNA functions correlated well with the tissue-associated biological processes. For 399

example, functionally annotated lincRNAs expressed in shoot apical meristem, floral tissue 400

and root were highly enriched with processes associated with regulation of photoperiodism, 401

sexual reproduction and phloem transport respectively. The results suggest possible 402

involvement of lincRNAs in tissue specific biological processes. 403

Finally, lincRNAs often exert their function on neighbouring protein-coding genes therefore 404

analysis of overrepresentation of classes of protein-coding genes flanking lincRNA loci 405

provides additional source of functional annotation. The genes flanking lincRNAs were 406

enriched in functions associated with transcription and development, suggesting possible 407

lincRNA involvement in these processes (Table S9). 408

Several lincRNAs are potentially related to agronomic traits 409

Genome-wide association studies (GWAS) have been successful in uncovering the genetic 410

basis of trait variation and linking casual loci to phenotypic traits. However, only a portion of 411

variants identified by GWA studies can be assigned to protein-coding genes (Sonah et al., 412

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

18

2015; Zhang et al., 2015; Zhou et al., 2015; Zhou et al., 2015). Some of the remaining 413

intergenic trait-associated variants can potentially be assigned to lincRNAs and serve as an 414

additional source of functional annotation. In total 316 SNPs identified as associated with 415

agronomic traits were used in the analysis. A lincRNA was identified as potentially related to 416

a trait if the SNP was found either within the lincRNA locus or the locus was closer to the 417

SNP than any other protein-coding gene. In total 23 lincRNA candidates have been identified 418

(Fig S4). Six of the lincRNAs overlapped trait-associated SNPs, the reminder was found in 419

close proximity (median distance 981 bp). The putative trait-related lincRNA are enriched in 420

multi-exon loci (Fisher test, p-value < 0.01). The SNPs proximal to candidate lincRNA loci 421

were related to traits such as number of days to flowering, number of days from flowering to 422

maturity and number of seeds per pod. 423

Several loci are typically found in the vicinity of a trait associated SNP and it is usually not 424

immediately obvious which may contribute to the trait. Accordingly, although the 23 425

lincRNAs were found closer to the SNP than any other protein-coding gene, it possible that a 426

more distal coding gene contributes to the trait instead of the lincRNA (an interaction 427

between the lincRNA and neighbouring protein-coding gene is also possible). To add more 428

confidence to the functional predictions, the genomic-position-only based analysis was 429

supplemented with investigation of expression patterns of neighbouring genes. For each of 430

the 23 putative, trait related lincRNAs the samples with peak expression for the lincRNA as 431

well as 5 upstream and downstream protein-coding genes were investigated. The lincRNAs 432

were considered more likely to influence the trait if they were showed peak expression in a 433

relevant tissue (for example, lincRNA associated with days to flowering being highly 434

expressed in shoot apical meristem upon short day treatment, Table S10, Fig. S4). As a result, 435

top six lincRNAs which were found in the vicinity of trait associated SNPs and showed 436

consistent expression patterns were analysed in more detail (Fig 6A). Interestingly, 4 out of 6 437

had a positionally similar lincRNAs in other species. Two of them showed expression in 438

similar tissue types across species (NC_GMAXST00018683, NC_GMAXPA00061260), for 439

the remaining two expression data in relevant tissue in chickpea and Medicago were 440

unavailable. One of the lincRNAs (NC_GMAXST00018683) which overlapped SNP 441

associated with the number of days to flowering and had peak expression in shoot apical 442

meristem upon short day treatment had positional similarity with lincRNA in chickpea (Fig. 443

6B). Comparison of expression patterns across samples in soybean and chickpea showed the 444

lincRNAs to be expressed in flower buds and shoot apical meristem (SAM) in both species 445

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

19

(Fig. 6B). The other lincRNA (NC_GMAXPA00061260) was found 223 bp from a SNP 446

associated with number of seeds per pod and again had a positionally similar lincRNA in 447

chickpea. Both lincRNAs showed peak expression in mature flowers (Fig 6C). The proximity 448

to trait associated SNPs, expression in relevant tissues and conservation of expression 449

patterns across species makes them likely candidate for trait related lincRNAs. Combination 450

of proximity to trait associated SNPs and expression profile, as well as intra-species 451

conservation has been successfully used for functional annotation of lncRNAs in other 452

species including human, zebrafish, rice and maize (Ulitsky et al., 2011; Gong et al., 2015; 453

Wang et al., 2015; Hon et al., 2017). In human, the study incorporating expression and 454

genetic data found that lncRNAs which harboured trait associated SNPs were also 455

specifically expressed in tissues relevant to the trait, leading the authors to conclude the 456

lncRNAs are likely functional and play important roles in disease (Hon et al., 2017). Further, 457

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

20

the putative functional lncRNAs also exhibited higher levels of conservation (Hon et al., 458

2017). Similarly, in maize, SNPs associated with leaf morphological traits were significantly 459

enriched in genomic loci encoding maize lincRNAs, leading the authors to suggest roles of 460

lincRNAs in control of agronomic traits (Wang et al., 2015). Even without the support of 461

GWAS data lncRNA conservation itself was also found to be indicative of functionality. In 462

zebrafish, lincRNAs selected based on their tissue-specific expression and synteny with 463

mammalian lincRNAs were shown to be important for developmental processes (Ulitsky et 464

al., 2011). Taken together, the availability of evidence from several sources and earlier 465

studies suggesting that the GWAS, expression profile and conservation evidence is highly 466

indicative of lncRNA functionality gives additional confidence in the functional predictions. 467

Conclusions 468

The soybean genome encodes several thousand of long intergenic non-coding RNAs, and 469

several lincRNAs may be related to agronomic traits. Further investigations on detailed 470

function and regulation including identification of interacting partners and regulators of the 471

lincRNAs will elucidate their mechanism of action. The study also provides evidence that the 472

network controlling and implementing biological processes in soybean involves complex 473

interactions between proteins and long and short non-coding RNAs. Further, this study 474

presents a first, comprehensive atlas of lincRNAs in the soybean genome and paves the way 475

for the future research. 476

Materials and Methods 477

Data 478

RNASeq sequence data corresponding to Sequence Read Archive (SRA) projects SRP020868 479

and PRJNA238493 were downloaded (full list of accessions can be found in Table S1). 480

Glycine max genome assembly (Gmax_275_v2.0) and corresponding annotation 481

(Gmax_275_Wm82.a2.v1) were downloaded from Phytozome v10. 482

lincRNA annotation 483

Reads were mapped to the reference genome using HISAT2 (Kim et al., 2015) v2.0.5 (--min-484

intronlen 20 --max-intronlen 2000). For each accession transcripts were assembled and 485

subsequently merged using StringTie (Pertea et al., 2015) v1.3.0 (--merge -F 0.5 -T 0.5 -G 486

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

21

Gmax_275_Wm82.a2.v1.gene_exons.gff3). Reads were also assembled using Trinity 487

(Grabherr et al., 2011) v2.3.2. Both de novo (--seqType fq --max_memory 50G --verbose --488

normalize_reads --trimmomatic --CPU 16) and reference guided (--genome_guided_bam --489

genome_guided_max_intron 10000 --max_memory 50G --verbose --CPU 16, reads trimmed 490

and normalized during de novo Trinity run were used) assemblies were performed. The 491

resulting StringTie and Trinity assemblies were supplied to PASA (Haas et al., 2003) in order 492

to build comprehensive transcriptome database using procedure as described in PASA user 493

guide (http://pasapipeline.github.io/). The aligner used was BLAT and 494

MAX_INTRON_LENGTH was set to 2000. StringTie only and PASA transcripts were 495

processed in parallel to identify potential lncRNAs (Fig. S1). Transcripts >200 bp in length 496

were subjected to ORF discovery using OrfPredictor (Min et al., 2005) v3.0. Transcripts with 497

ORF >300 bp (100 aa) were considered coding. Remaining transcripts were extracted and 498

subjected to DIAMOND (Buchfink et al., 2015) v0.8.25 blastx search (--more-sensitive --499

evalue 0.01) against NCBI nr database (obtained on the 23.10.2016). Transcript which had a 500

significant blastx hit were considered coding. The remaining transcripts fulfilling the three 501

criteria (1) Length >200 bp (2) ORF size <=300 bp (3) No significant blastx hit were 502

considered putative lncRNAs. A gene was considered coding if at least one transcript was 503

coding. A gene was considered non-coding if none of the transcripts were coding. The 504

positions of non-coding genes from StringTie and PASA annotations were compared against 505

positions of coding genes in both annotations. If the putative lncRNA gene did not overlap 506

any coding loci it was considered lincRNA gene. lincRNA loci from both annotations were 507

merged. If lincRNA loci from both annotations had positional overlap, StringTie annotation 508

was kept. Finally, reads mapping to gene were counted using Subread v1.5.1 featureCounts 509

(Liao et al., 2014) (-p -B -P -d 0 -D 1000) and FPKM values were calculated for each gene 510

(109*fragments mapped to exons/assigned fragments*total length of exons). lincRNAs which 511

did not have FPKM value larger than 0.1 in one of the samples were discarded. 512

lincRNA functional annotation 513

LincRNA functional annotation was performed by building lincRNA-protein-coding gene co-514

expression network. Co-expression was measured between identified lincRNA loci and 515

protein-coding loci from Gmax_275_Wm82.a2.v1 annotation updated by StringTie. FPKM 516

values were used for Spearman correlation calculation. Correlation coefficients and 517

corresponding p-values were calculated using corr.test function of R package Psych. 518

Adjustment for multiple comparisons was performed using method ‘holm’. Only lincRNA-519

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

22

protein-coding gene pairs with p-value < 0.05 were retained. All the protein partners were 520

functionally annotated using Blast2GO (Conesa et al., 2005) (nr subset corresponding to 521

"Arabidopsis"[porgn] OR "Oryza"[porgn] OR "Sorghum"[porgn] OR "Glycine"[porgn] OR 522

"Medicago"[porgn] OR "Brachypodium"[porgn]). For each of the lincRNAs all the proteins 523

which were significantly correlated were gathered and GO enrichment of biological processes 524

(BP) category was calculated using topGO (Alexa et al., 2006) v2.22.0. All proteins in 525

correlation with lincRNAs were used as background. Adjustment for multiple comparisons 526

was performed using method ‘weight’. GO terms which were significantly enriched were 527

assigned to the corresponding lincRNA as functional annotation (p-value cut-off 0.05). The 528

GO terms were mapped to the plant GOslim terms using Map2Slim option of owltools. 529

Sequence based similarity of lincRNAs 530

Sequence based similarity of lincRNAs was measured using reciprocal best BLAST, 531

BLAST+ v2.5.0 (-task blastn –evalue 1e-3). Best hits were identified by lowest e-value. 532

Coordinates of chickpea lincRNAs were obtained from Khemka et al. (Khemka et al., 2016), 533

Medicago lincRNAs from Wang et al. (Wang et al., 2015) and A. thaliana lincRNA from 534

http://chualab.rockefeller.edu/gbrowse2/homepage.html. The lincRNA sequences were 535

extracted from genome assemblies (chickpea Cicer_arietinum_GA_v1.0, Medicago Mt4.0v1 536

and A. thaliana TAIR9). Comparisons against genome sequence were performed using 537

BLAST+ v2.5.0 (-task dc-megablast –evalue 1e-3). In order to remove spurious hits due to 538

presence of transposable elements or repetitive sequences, lincRNAs which had more than 539

three matches in the genome were excluded. Additionally, the most significant HSP between 540

lincRNA and the genome was required to cover at least 10% of the lincRNA. 541

Transposable element (TE) composition of lincRNAs 542

The soybean TE database was obtained from SoyBase (SoyBase_TE_Fasta.txt). The 543

lincRNA transcripts were compared against the TE database using BLAST+ v2.2.30 (blastn -544

task megablast –evalue 1e-5). The 50,000 random non-overlapping intervals which did not 545

overlap lincRNAs were identified in the soybean genome using regioneR (Gel et al., 2016). 546

The corresponding sequences were extracted and compared against the TE database with the 547

same BLAST parameters as for lincRNAs, 548

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

23

Centromeric lincRNA identification 549

Centromeres were identified by presence of two soybean centromere specific repeats 550

CentGm-1 and CentGm-2. CentGm-1 and CentGm-2 were compared against soybean 551

genome (Gmax_275_v2.0) using BLAST+ v2.2.30 (blastn -task megablast). The coordinates 552

of centromere for a given chromosome corresponded to 1st and 3rd quartile of CentGm-1 and 553

CentGm-2 match coordinates. LincRNAs which fell within centromeres were identified as 554

centromeric lincRNAs. 555

Position based similarity of lincRNAs 556

Syntenic blocks between genomes of soybean (Gmax_275_v2.0), chickpea 557

(Cicer_arietinum_GA_v1.0), Medicago (Mt4.0v1) and A. thaliana (TAIR10) were identified 558

using MCScanX (Wang et al., 2012). The syntenic blocks were used to identify positional 559

similarity between soybean lincRNAs and lincRNAs from other species. For each lincRNA 560

five protein-coding neighbours upstream and downstream were extracted. The neighbours 561

were then compared with collinear blocks identified by MCScanX. The lincRNA was said to 562

belong to a collinear block if at least three out of ten protein-coding neighbours were fund in 563

the block. lincRNAs from two species were said to be positionally similar if they belonged to 564

the same collinear block, at least one of the two pairs of flanking protein-coding genes were 565

identified as orthologous and the lincRNAs shared the same relative position (upstream or 566

downstream) with respect to the orthologous gene/genes. The lincRNA loci which shared 567

positional similarity were compared using BLAST+ v2.5.0 (-task blastn –evalue 1e-3). 568

Comparison against RefSeq RNA database (downloaded on: 27.06.2017) was also performed 569

with BLAST+ v2.5.0 (-task blastn –evalue 1e-3). 570

Generation of control datasets 571

The control datasets were generated by assigning existing lincRNA to new protein-coding 572

neighbours, taken from the pool of all protein-coding genes found in the genome. For datasets 573

1 and 2 coordinate sorted full list of protein-coding genes was shuffled using Linux shuf 574

function, which generates random permutations, and first n genes corresponding to the 575

number of lincRNAs in a given dataset were assigned to existing lincRNAs. The assigned 576

protein-coding gene became new downstream protein-coding neighbour and the new 577

lincRNA position was immediately upstream of the protein-coding gene assigned. For 578

datasets 3 and 4 the procedure was similar, but the existing lincRNA clusters were kept 579

together. 580

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

24

Calculation of synonymous substitution rate 581

The synonymous substitution rates were computed between pairs of genes identified as 582

homeologous by MCScanX. Proteins were aligned by Clustal Omega v1.2.0 (Sievers et al., 583

2011). The protein alignments were converted to nucleotide alignments using PAL2NAL v14 584

(Suyama et al., 2006). The Ks values were calculated using PAML (yn00) v4.7 (Yang, 2007). 585

Selections of protein groups for comparison of tissue specific expression with lincRNAs 586

The protein-coding genes were divided into three categories. Genes expressed in no more 587

than 15 samples (high specificity expression pattern), genes expressed in 16 to 35 samples 588

(medium specificity expression pattern) and genes expressed in more than 35 samples (low 589

specificity expression pattern). For each group enrichment of biological processes category 590

was performed using topGO (Alexa et al., 2006), using all protein-coding genes as 591

background. Adjustment for multiple comparisons was performed using method ‘weight’. For 592

each category a representative process was chosen (process with the highest number of 593

significant genes among top ten enriched GO terms). All the genes from a given category 594

annotated with representative process were gathered and Tau specificity indices were 595

calculated (Yanai et al., 2005). 596

Identification of lincRNAs potentially related to agronomic traits 597

The positions of single nucleotide polymorphisms (SNPs) associated with agronomic traits 598

identified by Zhou et al (Zhou et al., 2015)., Zhang et al. (Zhang et al., 2015), Zhou et al. 599

(Zhou et al., 2015), Sonah et al. (Sonah et al., 2015) and Fang et al. (Fang et al., 2017) were 600

obtained. Some of the SNPs were originally discovered against an older version of soybean 601

genome (NCBI accession GCA_000004515.1) therefore their coordinates were transferred to 602

the Gmax_275_v2.0 genome assembly using NCBI remap tool 603

(https://www.ncbi.nlm.nih.gov/genome/tools/remap). The lincRNA was consider potentially 604

related to agronomic trait if it either harboured a SNP identified in association studies or it 605

was closer to a SNP than any protein-coding gene and no further than 10 kb. 606

Code availability: The code used for generation of all the figures can be found 607

on:.https://github.com/agolicz/lncRNAs-Plots 608

Data availability: The dataset described in the manuscript can be downloaded from: 609

https://osf.io/d7qz2/ 610

Acknowledgements 611

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

25

Financial support for this work was obtained from the ARC Discovery Grant ARC 612 DP0988972 is gratefully acknowledged. This research was supported by Melbourne 613 Bioinformatics at the University of Melbourne, project UOM0033. 614

Figures 615

Fig 1. Comparison of properties of protein-coding and lincRNA genes. LincRNA genes 616

differ from protein-coding genes with respect to transcript length, number of exons per 617

transcript, number of transcripts per gene and transcriptional profile. (A) Comparison of 618

transcript lengths of coding and non-coding genes. Non-coding genes have shorter 619

transcripts. (B) Comparison of the number of transcripts found in coding and non-coding 620

genes. Non-coding genes have less isoforms. (C) Comparison of the number of exons found 621

in transcripts of coding and non-coding genes. Transcripts of non-coding genes have a lower 622

number of exons. (D) Comparison of log2(FPKM) values of coding and non-coding genes. 623

FPKM values calculated based on counts produced by featureCounts. Non-coding genes 624

show lower expression levels compared to protein-coding genes. (E) Plot presenting 625

distribution of protein-coding and lincRNA loci across 20 soybean chromosomes. lincRNA 626

loci are evenly distributed across chromosomes, whereas protein-coding genes show lower 627

density in centromeric regions. Starting from the outer ring: (1) protein-coding genes (2) all 628

lincRNA genes (3) non-TE lincRNA loci (4) lincRNA loci with transcripts harbouring TEs. 629

Fig 2. Transposable element composition of lincRNA. (A) Types of TEs found within 630

lincRNA transcripts and in 50,000 randomly selected regions of soybean genome. TE 631

composition of lincRNAs follows that of soybean genome. RLG = LTR Gypsy. RLC = LTR 632

Copia, RIu = LINE, RIL = LINE L1, DTT = Tc1-Mariner, DTO = PONG, DTM = Mutator, 633

DTH = PIF-Harbinger, DTC = CACTA, DHH = Helitron, * - p-value < 0.01, Fisher test. (B) 634

Expression patterns of lincRNAs located in centromeric regions (n=32). (C) TE composition 635

of centromeric lincRNAs. 636

Fig 3. Conservation of lincRNA loci in chickpea, Medicago and A. thaliana. (A) Number 637

of soybean lincRNA loci showing sequence similarity with lincRNAs or genomes of other 638

species. (B) Number of soybean lincRNA loci in biological and control datasets showing 639

positional similarity with lincRNAs in other species. (C) Number of soybean lincRNA loci in 640

biological and control datasets showing positional similarity with other lincRNAs in soybean 641

genome. (D) Ks values calculated for protein-coding gene pairs flanking homeologous 642

lincRNA loci and a random selection of homeologous protein-coding gene pairs (n=444). The 643

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

26

distribution of Ks values representing random selection has two peaks corresponding two 644

duplication events. The protein pairs flanking homeologous loci mostly represent single, 645

more recent duplication. (E) Correlation of expression between homeologous lincRNA and a 646

random selection of lincRNA pairs (n=3,000). Homeologous loci have higher levels of co-647

expression. 648

Fig 4. Sample specific lincRNA expression. (A) Comparison of the number of samples 649

showing expression of coding and non-coding genes. Expression of non-coding genes is more 650

sample specific. (B) Number of total and unique lincRNA genes expressed in each tissue 651

(samples from different time points for the same tissue were combined). Tissues with the 652

highest total number of lincRNAs expressed and the highest number of uniquely expressed 653

lincRNAs are floral tissue and shoot apical meristem. (C) Number of lincRNAs expressed in 654

each sample. The sample with most lincRNAs expressed is flower1. (D) Heatmap showing 655

relationships between samples. Samples from the same tissue have similar lincRNA 656

expression profiles and cluster together. The colour corresponds to distance calculated as 1-657

cor(log1p(FPKM)). Clustering was performed using hclust, method=complete. (E) Tau 658

expression specificity index calculated for lincRNA loci and three groups of proteins 659

representing different biological processes. Higher values of Tau, correspond to more sample 660

specific expression. Cotyledon 1 – germination stage; cotyledon 2 – trefoil stage; flower 1 – 661

flower bud differentiation stage; flower 2 – flowering stage, bud before flowering; flower 3 – 662

flowering stage, florescence; flower 4 – flowering stage, 5 days after flowering; flower 5 – 663

flowering stage, florescence, different stage; leaf 1 – trefoil stage; leaf 2 - flower bud 664

differentiation stage; leaf 3 – senescent leaves; leafbud 1 – germination stage; leafbud 2 – 665

trefoil stage; leafbud 3 – flower bud differentiation stage; pod seed 1 – two weeks; pod seed 2 666

– three weeks; pod seed 3 – four weeks; pod 1 – three weeks; pod 2 – four weeks; pod 3 – 667

five weeks; seed 1 – three weeks; seed 2 – five weeks; seed 3 – six weeks; seed 4 – eight 668

weeks; seed 5 – ten weeks; shoot meristem – flower bud differentiation stage; stem 1- 669

germination; stem 2 – trefoil stage; root – germination stage; sam sd0 – shoot apical meristem 670

– before short day treatment; sam sd1-4 – shoot apical meristem – short days 1-4; leaf sd0 – 671

leaf – before short day treatment; leaf sd1-3 – leaf – short days 1-3. 672

Fig 5. Significantly enriched biological processes among lincRNAs which show peak 673

expression in a given tissue. Enrichment calculated using topGO, adjustment for multiple 674

comparisons using method ‘weight’, p-value < 0.01. 675

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

27

Fig 6. Analysis of potential trait-related lincRNAs. (A) For each of the putative trait related 676

lincRNAs the plot presents eleven genes found in close vicinity of trait associated SNP (1 677

putative trait related lincRNA + 5 downstream protein-coding genes + 5 upstream protein-678

coding genes). Each dot-point represents a gene (red – lincRNA, blue – protein-coding) and 679

is labelled with sample with peak expression (y axis represents actual expression value). (B, 680

C) Expression of lincRNA in soybean samples and its positional orthologue in chickpea. The 681

x axis corresponds to samples., the y axis corresponds to the FPKM values. 682

683

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

Parsed CitationsAlexa A, Rahnenführer J, Lengauer T (2006) Improved scoring of functional groups from gene expression data by decorrelating GOgraph structure. Bioinformatics 22: 1600-1607

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Ariel F, Jegu T, Latrasse D, Romero-Barrios N, Christ A, Benhamed M, Crespi M (2014) Noncoding transcription by alternative rnapolymerases dynamically regulates an auxin-driven chromatin loop. Molecular Cell 55: 383-396

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Bardou F, Ariel F, Simpson CG, Romero-Barrios N, Laporte P, Balzergue S, Brown JWS, Crespi M (2014) Long Noncoding RNAModulates Alternative Splicing Regulators in Arabidopsis. Developmental Cell 30: 166-176

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Berry S, Dean C (2015) Environmental perception and epigenetic memory: Mechanistic insight through FLC. Plant Journal 83: 133-148Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Boyle EA, Li YI, Pritchard JK (2017) An Expanded View of Complex Traits: From Polygenic to Omnigenic. Cell 169: 1177-1186Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Buchfink B, Xie C, Huson DH (2015) Fast and sensitive protein alignment using DIAMOND. Nat Meth 12: 59-60Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Carlevaro-Fita J, Rahim A, Guigó R, Vardy LA, Johnson R (2016) Cytoplasmic long noncoding RNAs are frequently bound to anddegraded at ribosomes in human cells. RNA 22: 867-882

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Chekanova JA (2015) Long non-coding RNAs and their functions in plants. Current Opinion in Plant Biology 27: 207-216Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Chekanova JA, Gregory BD, Reverdatto SV, Chen H, Kumar R, Hooker T, Yazaki J, Li P, Skiba N, Peng Q, Alonso J, Brukhin V,Grossniklaus U, Ecker JR, Belostotsky DA (2007) Genome-Wide High-Resolution Mapping of Exosome Substrates Reveals HiddenFeatures in the Arabidopsis Transcriptome. Cell 131: 1340-1353

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Cifuentes-Rojas C, Kannan K, Tseng L, Shippen DE (2011) Two RNA subunits and POT1a are components of Arabidopsis telomerase.Proceedings of the National Academy of Sciences of the United States of America 108: 73-78

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Conesa A, Götz S, García-Gómez JM, Terol J, Talón M, Robles M (2005) Blast2GO: a universal tool for annotation, visualization andanalysis in functional genomics research. Bioinformatics 21: 3674-3676

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Derrien T, Johnson R, Bussotti G, Tanzer A, Djebali S, Tilgner H, Guernec G, Martin D, Merkel A, Knowles DG, Lagarde J, Veeravalli L,Ruan X, Ruan Y, Lassmann T, Carninci P, Brown JB, Lipovich L, Gonzalez JM, Thomas M, Davis CA, Shiekhattar R, Gingeras TR,Hubbard TJ, Notredame C, Harrow J, Guigó R (2012) The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their genestructure, evolution, and expression. Genome Research 22: 1775-1789

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Du J, Grant D, Tian Z, Nelson RT, Zhu L, Shoemaker RC, Ma J (2010) SoyTEdb: a comprehensive database of transposable elements in www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from

Copyright © 2017 American Society of Plant Biologists. All rights reserved.

the soybean genome. BMC Genomics 11: 113Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Fang C, Ma Y, Wu S, Liu Z, Wang Z, Yang R, Hu G, Zhou Z, Yu H, Zhang M, Pan Y, Zhou G, Ren H, Du W, Yan H, Wang Y, Han D, Shen Y,Liu S, Liu T, Zhang J, Qin H, Yuan J, Yuan X, Kong F, Liu B, Li J, Zhang Z, Wang G, Zhu B, Tian Z (2017) Genome-wide associationstudies dissect the genetic networks underlying agronomical traits in soybean. Genome Biology 18: 161

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Flynn RA, Chang HY (2014) Long noncoding RNAs in cell-fate programming and reprogramming. Cell Stem Cell 14: 752-761Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Franco-Zorrilla JM, Valli A, Todesco M, Mateos I, Puga MI, Rubio-Somoza I, Leyva A, Weigel D, García JA, Paz-Ares J (2007) Targetmimicry provides a new mechanism for regulation of microRNA activity. Nature Genetics 39: 1033-1037

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Gel B, Díez-Villanueva A, Serra E, Buschbeck M, Peinado MA, Malinverni R (2016) regioneR: an R/Bioconductor package for theassociation analysis of genomic regions based on permutation tests. Bioinformatics 32: 289-291

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Gong J, Liu W, Zhang J, Miao X, Guo A-Y (2015) lncRNASNP: a database of SNPs in lncRNAs and their potential functions in human andmouse. Nucleic Acids Research 43: D181-D186

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E,Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A (2011) Trinity: reconstructinga full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology 29: 644-652

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK, Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD, Salzberg SL, White O(2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Research 31: 5654-5666

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Hao Z, Fan C, Cheng T, Su Y, Wei Q, Li G (2015) Genome-wide identification, characterization and evolutionary analysis of longintergenic noncoding rnas in cucumber. PLoS ONE 10

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Heo JB, Lee YS, Sung S (2013) Epigenetic regulation by long noncoding RNAs in plants. Chromosome Research 21: 685-693Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Heo JB, Sung S (2011) Vernalization-mediated epigenetic silencing by a long intronic noncoding RNA. Science (New York, N.Y.) 331:76-79

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Hezroni H, Koppstein D, Schwartz MG, Avrutin A, Bartel DP, Ulitsky I (2015) Principles of long noncoding RNA evolution derived fromdirect comparison of transcriptomes in 17 species. Cell reports 11: 1110-1122

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Hon C-C, Ramilowski JA, Harshbarger J, Bertin N, Rackham OJL, Gough J, Denisenko E, Schmeier S, Poulsen TM, Severin J, Lizio M,Kawaji H, Kasukawa T, Itoh M, Burroughs AM, Noma S, Djebali S, Alam T, Medvedeva YA, Testa AC, Lipovich L, Yip C-W, Abugessaisa I,Mendez M, Hasegawa A, Tang D, Lassmann T, Heutink P, Babina M, Wells CA, Kojima S, Nakamura Y, Suzuki H, Daub CO, de Hoon

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

MJL, Arner E, Hayashizaki Y, Carninci P, Forrest ARR (2017) An atlas of human long non-coding RNAs with accurate 5′ ends. Nature543: 199

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Jin J, Liu J, Wang H, Wong L, Chua NH (2013) PLncDB: Plant long non-coding RNA database. Bioinformatics 29: 1068-1071Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, Willingham AT, Stadler PF, Hertel J, Hackermüller J, Hofacker IL, Bell I, Cheung E,Drenkow J, Dumais E, Patel S, Helt G, Ganesh M, Ghosh S, Piccolboni A, Sementchenko V, Tammana H, Gingeras TR (2007) RNA mapsreveal new RNA classes and a possible function for pervasive transcription. Science 316: 1484-1488

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Kapusta A, Kronenberg Z, Lynch VJ, Zhuo X, Ramsay L, Bourque G, Yandell M, Feschotte C (2013) Transposable Elements Are MajorContributors to the Origin, Diversification, and Regulation of Vertebrate Long Noncoding RNAs. PLoS Genetics 9: e1003470

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Khemka N, Singh VK, Garg R, Jain M (2016) Genome-wide analysis of long intergenic non-coding RNAs in chickpea and their potentialrole in flower development. Scientific Reports 6

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Kim D, Langmead B, Salzberg SL (2015) HISAT: a fast spliced aligner with low memory requirements. Nat Meth 12: 357-360Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Kornienko AE, Guenzl PM, Barlow DP, Pauler FM (2013) Gene regulation by the act of long non-coding RNA transcription. BMC Biology11

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Kryuchkova-Mostacci N, Robinson-Rechavi M (2017) A benchmark of gene expression tissue-specificity metrics. Briefings inBioinformatics 18: 205-214

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Lai F, Orom UA, Cesaroni M, Beringer M, Taatjes DJ, Blobel GA, Shiekhattar R (2013) Activating RNAs associate with Mediator toenhance chromatin architecture and transcription. Nature 494: 497-501

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Li L, Eichten SR, Shimizu R, Petsch K, Yeh CT, Wu W, Chettoor AM, Givan SA, Cole RA, Fowler JE, Evans MMS, Scanlon MJ, Yu J,Schnable PS, Timmermans MCP, Springer NM, Muehlbauer GJ (2014) Genome-wide discovery and characterization of maize long non-coding RNAs. Genome Biology 15

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Liao Q, Liu C, Yuan X, Kang S, Miao R, Xiao H, Zhao G, Luo H, Bu D, Zhao H, Skogerbø G, Wu Z, Zhao Y (2011) Large-scale prediction oflong non-coding RNA functions in a coding–non-coding gene co-expression network. Nucleic Acids Research 39: 3864-3878

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Liao Y, Smyth GK, Shi W (2014) featureCounts: an efficient general purpose program for assigning sequence reads to genomicfeatures. Bioinformatics 30: 923-930

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Liu J, Jung C, Xu J, Wang H, Deng S, Bernad L, Arenas-Huertero C, Chua N-H (2012) Genome-Wide Analysis Uncovers Regulation ofLong Intergenic Noncoding RNAs in Arabidopsis. The Plant Cell 24: 4333-4345

Pubmed: Author and Title www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

CrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Matzke MA, Kanno T, Matzke AJ (2014) RNA-Directed DNA Methylation: The Evolution of a Complex Epigenetic Pathway in FloweringPlants. Annu. Rev. Plant Biol

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Min XJ, Butler G, Storms R, Tsang A (2005) OrfPredictor: Predicting protein-coding regions in EST-derived sequences. Nucleic AcidsResearch 33: W677-W680

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Mohammadin S, Edger PP, Pires JC, Schranz ME (2015) Positionally-conserved but sequence-diverged: identification of long non-coding RNAs in the Brassicaceae and Cleomaceae. BMC Plant Biology 15: 217

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Nelson BR, Makarewich CA, Anderson DM, Winders BR, Troupes CD, Wu F, Reese AL, McAnally JR, Chen X, Kavalali ET, Cannon SC,Houser SR, Bassel-Duby R, Olson EN (2016) A peptide encoded by a transcript annotated as long noncoding RNA enhances SERCAactivity in muscle. Science 351: 271

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Niazi F, Valadkhan S (2012) Computational analysis of functional long noncoding RNAs reveals lack of peptide-coding capacity andparallels with 3′ UTRs. RNA 18: 825-843

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Pefanis E, Wang J, Rothschild G, Lim J, Kazadi D, Sun J, Federation A, Chao J, Elliott O, Liu ZP, Economides AN, Bradner JE, RabadanR, Basu U (2015) RNA exosome-regulated long non-coding RNA transcription controls super-enhancer activity. Cell 161: 774-789

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Pertea M, Pertea GM, Antonescu CM, Chang T-C, Mendell JT, Salzberg SL (2015) StringTie enables improved reconstruction of atranscriptome from RNA-seq reads. Nat Biotech 33: 290-295

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Pfeil BE, Schlueter JA, Shoemaker RC, Doyle JJ, Page R (2005) Placing Paleopolyploidy in Relation to Taxon Divergence: APhylogenetic Analysis in Legumes Using 39 Gene Families. Systematic Biology 54: 441-454

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Rinn JL, Chang HY (2012) Genome regulation by long noncoding RNAs. In Annual Review of Biochemistry, Vol 81, pp 145-166Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Rošić S, Erhardt S (2016) No longer a nuisance: long non-coding RNAs join CENP-A in epigenetic centromere regulation. Cellular andMolecular Life Sciences 73: 1387-1398

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Ruiz-Orera J, Messeguer X, Subirana JA, Alba MM (2014) Long non-coding RNAs as a source of new peptides. eLife 3: e03523Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Schlueter JA, Dixon P, Granger C, Grant D, Clark L, Doyle JJ, Shoemaker RC (2004) Mining EST databases to resolve evolutionaryevents in major crop species. Genome 47: 868-876

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Schmutz J, Cannon SB, Schlueter J, Ma J, Mitros T, Nelson W, Hyten DL, Song Q, Thelen JJ, Cheng J, Xu D, Hellsten U, May GD, Yu Y,Sakurai T, Umezawa T, Bhattacharyya MK, Sandhu D, Valliyodan B, Lindquist E, Peto M, Grant D, Shu S, Goodstein D, Barry K, Futrell- www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from

Copyright © 2017 American Society of Plant Biologists. All rights reserved.

Griggs M, Abernathy B, Du J, Tian Z, Zhu L, Gill N, Joshi T, Libault M, Sethuraman A, Zhang X-C, Shinozaki K, Nguyen HT, Wing RA,Cregan P, Specht J, Grimwood J, Rokhsar D, Stacey G, Shoemaker RC, Jackson SA (2010) Genome sequence of the palaeopolyploidsoybean. Nature 463: 178-183

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Shen Y, Zhou Z, Wang Z, Li W, Fang C, Wu M, Ma Y, Liu T, Kong L-A, Peng D-L, Tian Z (2014) Global Dissection of Alternative Splicing inPaleopolyploid Soybean. The Plant Cell Online

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Shoemaker RC, Schlueter J, Doyle JJ (2006) Paleopolyploidy and gene duplication in soybean and other legumes. Current Opinion inPlant Biology 9: 104-109

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG(2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology7: 539-539

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Smith MA, Mattick JS (2017) Structural and Functional Annotation of Long Noncoding RNAs. In JM Keith, ed, Bioinformatics: Volume II:Structure, Function, and Applications. Springer New York, New York, NY, pp 65-85

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Sonah H, O'Donoughue L, Cober E, Rajcan I, Belzile F (2015) Identification of loci governing eight agronomic traits using a GBS-GWASapproach and validation by QTL mapping in soya bean. Plant Biotechnology Journal 13: 211-221

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Suyama M, Torrents D, Bork P (2006) PAL2NAL: robust conversion of protein sequence alignments into the corresponding codonalignments. Nucleic Acids Research 34: W609-W612

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Swiezewski S, Liu F, Magusin A, Dean C (2009) Cold-induced silencing by long antisense transcripts of an Arabidopsis Polycombtarget. Nature 462: 799-802

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Szcześniak MW, Rosikiewicz W, Makałowska I (2016) CANTATAdb: A collection of plant long non-coding RNAs. Plant and CellPhysiology 57: e8

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Ulitsky I, Bartel DP (2013) LincRNAs: Genomics, evolution, and mechanisms. Cell 154: 26-46Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Ulitsky I, Shkumatava A, Jan CH, Sive H, Bartel DP (2011) Conserved Function of lincRNAs in Vertebrate Embryonic DevelopmentDespite Rapid Sequence Evolution. Cell 147: 1537-1550

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Van Werven FJ, Neuert G, Hendrick N, Lardenois A, Buratowski S, Van Oudenaarden A, Primig M, Amon A (2012) Transcription of twolong noncoding RNAs mediates mating-type control of gametogenesis in budding yeast. Cell 150: 1170-1181

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Wang H, Chung PJ, Liu J, Jang IC, Kean MJ, Xu J, Chua NH (2014) Genome-wide identification of long noncoding natural antisensetranscripts and their responses to light in Arabidopsis. Genome Research 24: 444-453 www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from

Copyright © 2017 American Society of Plant Biologists. All rights reserved.

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Wang H, Niu QW, Wu HW, Liu J, Ye J, Yu N, Chua NH (2015) Analysis of non-coding transcriptome in rice and maize uncovers roles ofconserved lncRNAs associated with agriculture traits. Plant Journal 84: 404-416

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Wang KC, Chang HY (2011) Molecular mechanisms of long noncoding RNAs. Molecular cell 43: 904-914Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Wang T-Z, Liu M, Zhao M-G, Chen R, Zhang W-H (2015) Identification and characterization of long non-coding RNAs involved in osmoticand salt stress in Medicago truncatula using genome-wide high-throughput sequencing. BMC Plant Biology 15: 131

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Wang Y, Fan X, Lin F, He G, Terzaghi W, Zhu D, Deng XW (2014) Arabidopsis noncoding RNA mediates control of photomorphogenesisby red light. Proceedings of the National Academy of Sciences of the United States of America 111: 10359-10364

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Wang Y, Tang H, DeBarry JD, Tan X, Li J, Wang X, Lee T-h, Jin H, Marler B, Guo H, Kissinger JC, Paterson AH (2012) MCScanX: atoolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Research 40: e49-e49

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Wong CE, Singh MB, Bhalla PL (2013) The Dynamics of Soybean Leaf and Shoot Apical Meristem Transcriptome Undergoing FloralInitiation Process. PLoS ONE 8

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Wu HJ, Wang ZM, Wang M, Wang XJ (2013) Widespread long noncoding RNAs as endogenous target mimics for microRNAs in plants.Plant Physiology 161: 1875-1884

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Yanai I, Benjamin H, Shmoish M, Chalifa-Caspi V, Shklar M, Ophir R, Bar-Even A, Horn-Saban S, Safran M, Domany E, Lancet D,Shmueli O (2005) Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification.Bioinformatics 21: 650-659

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Yang Z (2007) PAML 4: Phylogenetic Analysis by Maximum Likelihood. Molecular Biology and Evolution 24: 1586-1591Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Zhang J, Song Q, Cregan PB, Nelson RL, Wang X, Wu J, Jiang G-L (2015) Genome-wide association study for flowering time, maturitydates and plant height in early maturing soybean (Glycine max) germplasm. BMC Genomics 16: 217

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Zhang YC, Liao JY, Li ZY, Yu Y, Zhang JP, Li QF, Qu LH, Shu WS, Chen YQ (2014) Genome-wide screening and functional analysisidentify a large number of long noncoding RNAs involved in the sexual reproduction of rice. Genome biology 15: 512

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Zhou L, Wang S-B, Jian J, Geng Q-C, Wen J, Song Q, Wu Z, Li G-J, Liu Y-Q, Dunwell JM, Zhang J, Feng J-Y, Niu Y, Zhang L, Ren W-L,Zhang Y-M (2015) Identification of domestication-related loci associated with flowering time and seed size in soybean with the RAD-seqgenotyping method. Scientific Reports 5: 9350

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.

Zhou Z, Jiang Y, Wang Z, Gou Z, Lyu J, Li W, Yu Y, Shu L, Zhao Y, Ma Y, Fang C, Shen Y, Liu T, Li C, Li Q, Wu M, Wang M, Wu Y, Dong Y,Wan W, Wang X, Ding Z, Gao Y, Xiang H, Zhu B, Lee S-H, Wang W, Tian Z (2015) Resequencing 302 wild and cultivated accessionsidentifies genes related to domestication and improvement in soybean. Nat Biotech 33: 408-414

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

Zhu YL, Song QJ, Hyten DL, Van Tassell CP, Matukumalli LK, Grimm DR, Hyatt SM, Fickus EW, Young ND, Cregan PB (2003) Single-nucleotide polymorphisms in soybean. Genetics 163: 1123-1134

Pubmed: Author and TitleCrossRef: Author and TitleGoogle Scholar: Author Only Title Only Author and Title

www.plantphysiol.orgon July 8, 2018 - Published by Downloaded from Copyright © 2017 American Society of Plant Biologists. All rights reserved.