exploring genomic diversity in wild and cultivated barley using ......7.3.1 phenotype of bw370 and...

Exploring genomic diversity in wild and

cultivated barley using high throughput DNA

sequencing technology

By Cong Tan

This thesis was presented for the degree of Doctor of Philosophy of

Murdoch University

February 2018

I

Declaration

I declare that this thesis is my own account of my research and contains as its main

content work which has not previously been submitted for a degree at any tertiary

education institution.

26/02/2018

II

Acknowledgements

The completion of this thesis could not have succeeded without the support and help

from several institutions and numerous individuals.

My research was funded by a Murdoch University Strategic Scholarship. I thank

Murdoch University for providing the facilities, supporting my research and giving

me such an opportunity to pursue my research career.

I acknowledge and thank the Grains Research and Development Corporation for their

grant supporting my research work. Without this financial support it would have

been impossible to perform all the work in my thesis.

I wish to express my sincere gratitude to my supervisor Professor Chengdao Li, for

his unfailing encouragement, generosity and patient guidance. Without his support it

would not have been possible, for me, to overcome the challenges of pursuing my

research career in Australia. I am grateful to him for giving me the invaluable

opportunity to join the barley group (Western Barley Genetics Alliance) and

including me in the community of barley research. I will never forget the moments

when he patiently explained the details of my research which I needed to pay

attention to. He changed my life and brought me into the holy place of science.

I appreciate Professor Matthew Bellgard, Director, Centre for Comparative

Genomics for his support and guidance in bioinformatic analysis and providing

computing facilities for my research. I thank him for giving me the opportunity to

join the Centre for Comparative Genomics. I thank Dr Roberto Barrero, Dr Kathryn

III

Napier, Dr Brett Chapman and Dr Mark Black for their kind help and patient

guidance.

I appreciate the support and help from my colleague in the barley group including Dr

Qisen Zhang, Dr Gaofeng Zhou, Mrs Ruth Abbott, Ms Sharon Westcott, Dr Tefera

Angessa, Dr Hao Zhou, Dr Yong Jia, Dr Yong Han, Dr Camilla Hill. I thank Dr

Qisen Zhang for the time he took to correct my spoken English and revise my

manuscripts. To Dr Gaofeng Zhou thank you for your guidance during my research

and giving me a hand whenever I was in difficulties. To Ruth for helping me in

ordering “stuff” needed for my research, preparing documents for a visa application

and graduation. Thank you to Sharon, Tefera, Hao, Yong Jia and Yong Han for your

kindness and patience in teaching me experimental techniques and excellent

suggestion whenever I turned to you for help. Thanks Camilla for your suggestion

throughout my research and the time taken to revise my manuscripts.

Thank you to Professor Guoping Zhang, Professor Fei Dai, Dr Xiaolei Wang from

Zhejiang University for their collaboration and contribution to the RNA sequencing

of wild barley. I thank Professor Eviatar Nevo for providing seeds of wild barleys

collected from the “Evolution Canyon” and revising the manuscript of the chapter 3.

I thank Professor Zujun Yang, Guangrong Li, Bin Li and Tao lang from University

of Electronic Science and Technology of China for their contribution to the

experiment “Fluorescence in situ hybridization” in Chapter 3. Thank you to Beijing

Genomic Institute (Shenzhen) for their excellent service in DNA sequencing.

IV

I thank Murdoch University staff Mr Dale Banks, Ms Julie Blake, Ms Jacqueline

Dyer and Ms Amy Sheppard Moeen for their kind help and support during my

research.

Australia is the first foreign country that my wife and I have travelled to. It was a big

challenge to settle down and adapt to life abroad. It was Dr Xiao-Qi Zhang who put a

lot of time and effort into helping us find accommodation, buy living necessities and

settle in to the Australian way of life. We will never forget her compassionate care

like a mother, especially in the first several months. We were so moved when she

sent us warm dumplings and a pair of pillows once she knew we were having some

difficulties. In our first six months she drove us to buy our food every week and

sometimes taught us how to cook. Her kindness and care allowed us to quickly adapt

to life abroad and enable me to focus on my study and research.

Of course, this thesis could not have been undertaken without the strong support

from my family. I appreciate and am grateful to my parents for their love,

commitment and effort in supporting my studies and research for so many years.

To my wife P ing Zhou thank you for your unstinting support, love, care and being

there for me for during the past four years. For giving me such a lovely daughter

Yutong Tan and carrying almost all the responsibility in taking care of her when

things were difficult. To my daughter Yutong Tan thank you for bringing me such

happiness. Thank you to my sisters and young brother for their love and support

during my studies abroad.

V

Finally, I thank Ms Christine Davies for her for editing and proofreading the

literature review part of this thesis.

VI

Abstract

Barley (Hordeum vulgare L.) was domesticated from wild barley (Hordeum

spontaneum L.) around 10,000 years ago in the Fertile Crescent. It has been widely

used as a model organism to research plant adaptive evolutionary processes as well

as a vital genetic repository to explore plant abiotic and biotic stress tolerance. This

thesis used the recent barley genome sequence as a reference to characterise the

genomic evolution of wild barley, explore genomic diversity, and created a database

of barley genomic variation.

Wild barley accessions from opposite slopes of the ‘Evolution Canyon’ in Israel,

separated by only 250 m but with drastically different microclimates, were

sequenced at the transcriptome and whole-genome levels. Sympatric speciation was

identified between the wild barleys from the two slopes, and the distance between

them was even greater than that between cultivated barley and Tibetan wild barley.

About 243 Mb genomic regions were identified under environmental selection. The

number of upregulated genes from the south-facing slope wild barley was about eight

times that of the north-facing slope under drought stress.

To explore genomic diversity in cultivated barley germplasm from Australia , whole-

genome sequencing was conducted on eleven cultivated barley varieties. In total,

29,550,641 single nucleotide polymorphisms (SNPs) and 1,592,034 short insertions

and deletions (InDels) were identified. These SNPs and InDels can be used to

develop Kompetitive Allele Specific PCR (KASP) and InDel markers for genome-

assisted breeding.

VII

To facilitate access to and use of the genomic resources exploited in this study, a

database of barley genomic variations, BarleyVarDB, was created. This database

provides three main sets of information: (1) SNPs and InDels, (2) barley genome

reference and its annotation, and (3) 522,212 whole genome-wide simple sequence

repeats (SSRs).

Using the resources developed in this study, I characterised the mutation pattern of a

barley mutant derived from gamma radiation. The results indicate that mutations

induced by gamma radiation are not evenly distributed in the whole genome but

occur in several concentrated regions in chromosomes. Besides, I narrowed down the

candidate genes for high-tillering mutant phenotype from 139 to 15 by comparing the

sequence difference in the 10 Mb target region between a pair of near isogenic lines

using transcriptome sequencing.

VIII

List of abbreviations

Abbr. Full name

BAC Bacterial Artificial Chromosome

cDNA Complementary DNA

ChIP-seq Chromatin Immunoprecipitation Sequencing

DArT Diversity Arrays Technology

dNTP Deoxyribonucleoside Triphosphates

EC The Evolution Canyon, in Israel

EST Expressed Sequence Tag

FAO Food and Agriculture Organization

Fst Wright's Fixation Index

InDel Small Fragment Insertion and Deletion

KASP Kompetitive Allele Specific PCR

LD Linkage Disequilibrium

N50 the Shortest Sequence Length At 50% Of the Genome

NCBI National Center for Biotechnology Information

NFS European North-Facing Slope of Evolution Canyon

IX

Abbr. Full name

NGS Next-generation sequencing

NIL Near Isogenic Line

NT Nucleotide-NCBI

PCR Polymerase Chain Reaction

QTL Quantitative Trait Loci

RFLP Restriction Fragment Length Polymorphism

RIL Recombinant Inbred Line

RNA-seq RNA sequencing

SFS African South-Facing Slope of Evolution Canyon

SNP Single Nucleotide Polymorphism

SSR Simple Sequence Repeat

UN The United Nations

X

Contents

Declaration ........................................................................................................... I

Acknowledgements ............................................................................................. II

Abstract ............................................................................................................. VI

List of abbreviations ....................................................................................... VIII

List of figures ................................................................................................... XIV

List of tables .................................................................................................. XVII

Chapter 1 Introduction .................................................................................. 1

Chapter 2 Literature review .......................................................................... 4

2.1 Barley genomic research ..........................................................................5

2.1.1 Barley domestication and evolution ......................................................6

2.1.2 Constructing the barley reference genome ............................................7

2.1.3 Molecular markers and genotyping .......................................................9

2.1.4 Linkage mapping ............................................................................... 12

2.1.5 Association mapping.......................................................................... 14

2.2 DNA sequencing technology and application.......................................... 17

2.2.1 DNA sequencing technology .............................................................. 18

XI

2.2.2 Constructing reference genome .......................................................... 26

2.2.3 Detecting genetic variants .................................................................. 29

2.2.4 RNA-seq for analysing gene expression level ..................................... 30

2.2.5 ChIP-seq determining protein binding sites ......................................... 32

2.2.6 Bisulfite sequencing for detecting methylation .................................... 33

Chapter 3 Sympatric speciation and adaptive evolution of wild barley in

“Evolution Canyon” .......................................................................................... 36

3.1 Introduction........................................................................................... 38

3.2 Methods ................................................................................................ 41

3.3 Result.................................................................................................... 47

3.3.1 Dramatic divergence between wild barleys from SFS and NFS............ 47

3.3.2 Bigger genomic divergence between SFS and NFS wild barleys than that

between Tibetan wild barleys and domesticated barleys.................................. 49

3.3.3 Genomic regions under environmental selection of NFS and SFS ........ 56

3.3.4 Differential drought stress response of SFS and NFS wild barleys ....... 61

3.4 Discussion............................................................................................. 73

Chapter 4 Genomic diversity of cultivated barley ....................................... 81

4.1 Introduction........................................................................................... 82

XII

4.2 Methods ................................................................................................ 84

4.3 Results .................................................................................................. 88

4.3.1 Whole genome sequencing of eleven cultivated barleys ...................... 88

4.3.2 SNP and InDel variants in eleven cultivated barleys ............................ 90

4.3.3 Evolutionary relationship analysis ...................................................... 91

4.3.4 Genetic divergence between Australian cultivated barley and EC wild

barley 92

4.4 Discussion............................................................................................. 96

Chapter 5 BarleyVarDB, a database of barley genomic variation............... 99

5.1 Introduction......................................................................................... 101

5.2 Database construction .......................................................................... 103

5.3 Database description ............................................................................ 108

5.4 Future improvement ............................................................................ 113

Chapter 6 Characterization of genome-wide variations induced by gamma-

ray radiation in barley using RNAseq ............................................................. 114

6.1 Introduction......................................................................................... 116

6.2 Methods .............................................................................................. 119

6.3 Results ................................................................................................ 121

XIII

6.3.1 SNPs and InDels screening through transcriptome sequencing .......... 121

6.3.2 Genetic mutations induced by gamma ray radiation........................... 123

6.3.3 Genes affected by mutations............................................................. 124

6.4 Discussion........................................................................................... 128

Chapter 7 Characterization of a barley mutant with high-tillering and

narrow leaf 132

7.1 Introduction......................................................................................... 133

7.2 Methods .............................................................................................. 135

7.3 Result.................................................................................................. 139

7.3.1 Phenotype of BW370 and Bowman .................................................. 139

7.3.2 Narrow down candidate genes by RNAseq ....................................... 141

7.4 Discussion........................................................................................... 144

Reference ......................................................................................................... 146

Appendix 1 - Bioinformatic Pipeline ............................................................... 179

Appendix 2 - List of publications..................................................................... 187

Appendix 3- One manuscript........................................................................... 189

XIV

List of figures

Figure 1-1 Thesis structure......................................................................................2

Figure 2-1 Main content of literature review. ...........................................................4

Figure 2-2 DNA sequencing technology and application. ....................................... 18

Figure 2-3 Diagrams of chain termination sequencing. ........................................... 19

Figure 2-4 Illumina/Solexa sequencing (Mardis, 2008) .......................................... 21

Figure 2-5 PACBIO SMRT technology. ................................................................ 23

Figure 2-6 Distribution of read length in SMART sequencing (from PACBIO

website) ............................................................................................................... 23

Figure 2-7 Schematic diagram of DNA methylation sequencing (Tollefsbol, 2011). 34

Figure 3-1 Cross-section view and Sample collection sites of the “Evolution Canyon”

I (EC-I), Lower Nahal Oren Mount Carmel. .......................................................... 39

Figure 3-2 Diagram of methods and aims .............................................................. 41

Figure 3-3 Genomic diversity and evolution analysis based on RNA sequencing of

ten samples from two slopes of ‘’Evolution Canyon”. ............................................ 48

Figure 3-4 Genomic diversity and evolution related analysis based on whole genome

sequencing of ten samples from four different groups. ........................................... 53

Figure 3-5 Distribution of FST values across the whole genome. ............................. 57

XV

Figure 3-6 Selection sweep analysis in whole genome. .......................................... 58

Figure 3-7 Statistics of DEGs under drought stress treatment in SFS and NFS. ....... 62

Figure 3-8 Density of DEGs in each 10 Mb region along chromosomes under

drought stress treatment in SFS and NFS. .............................................................. 63

Figure 3-9 Over-represented and under-represented GO terms for differentially

expressed genes in SFS wild barley under drought stress........................................ 64


expressed genes in NFS wild barley under drought stress. ...................................... 65

Figure 3-11 Karyotypical comparison of wild barley NFS6.1 and SFS2.3 and

cultivated barley Golden Promise by FISH. ........................................................... 74

Figure 4-1 Evolutionary relationships of 26 barley accessions. ............................... 92

Figure 4-2 Genomic difference between Australian cultivated barleys and EC wild

barleys. ................................................................................................................ 94

Figure 4-3 GO enrichment of genes under environmental selection. ....................... 95

Figure 5-1 Framework, implementation, and date collection of BarleyVarDB. ...... 104

Figure 5-2 Web interfaces of retrieving SNPs, INDELs, SSRs and gene annotation in

BarleyVarDB. .................................................................................................... 109

Figure 5-3 Examples of software applications in BarleyVarDB. ........................... 112

XVI

Figure 6-1 Phenotype of Vla-WT and Vla-MT and distribution of mutations in

chromosome 5 and chromosome 7....................................................................... 122

Figure 6-2 Genetic effects of mutations and biological function of genes with mutant

genes.................................................................................................................. 126

Figure 7-1 The generation of BW370. ................................................................. 136

Figure 7-2 Phenotype of BW370 and Bowman. ................................................... 140

XVII

List of tables

Table 2-1 Assembly and annotation statistics of the barley genome (Mascher et al. ,

2017) .....................................................................................................................9

Table 2-2 Performance comparison of dominant sequencers................................... 24

Table 2-3 Genome assembly and gene prediction of major crops and model

organisms............................................................................................................. 27

Table 3-1 Output statistics of whole-genome shotgun sequencing of ten barley

accessions. ........................................................................................................... 51

Table 3-2 Assembly statistics of eight barley accessions. ....................................... 52

Table 3-3 Evolutionary distance for ten EC wild barleys using whole genome

shotgun sequencing .............................................................................................. 55

Table 3-4 Plant reproductive related homology genes in barley .............................. 60

Table 3-5 Summary of DEGs in SFS and NFS wild barley under drought treatment 62

Table 3-6 Homology of plant drought related genes in barley ................................. 67

Table 4-1 Sample and data information ................................................................. 85

Table 4-2 Summary of sequencing output and variants calling ............................... 89

Table 4-3 Pairwise comparison of seven Australian barley varieties ....................... 90

Table 6-1 Genetic mutation counts in each chromosome. ..................................... 124

XVIII

Table 7-1 Candidate genes and function description............................................. 141

1

Chapter 1 Introduction

Food security is one of the major global issues. The rapidly expanding demand for

agricultural products mainly comes from the growing population and changing diet.

Increasing food production is facing several challenges including less arable land,

climate change and water scarcity (Godfray et al. , 2010). Improving crop yield is

considered a key measure for ensuring food security. Increases in crop yield in the

last century have relied on fertiliser use, which has caused severe environmental

pollution including land degeneration. This century, the principal approach for

increasing crop yields is to breed competitive crop varieties with improved grain and

environmental stress tolerance. Barley (Hordeum vulgare L.) is one of the founder

crops from the Old World and the fourth most important cereal crop now. It was

domesticated from wild barley (Hordeum spontaneum L.) around 10,000 years ago in

the Fertile Crescent. Barley can grow in diverse environments extreme latitude like

the Tibetan P lateau. It has been widely used as a model organism to research plant

adaptive evolutionary processes as well as a vital genetic repository to explore plant

abiotic and biotic stress tolerance.

This thesis used the recent barley genome sequence as a reference to characterise

genomic evolution of wild barley, explore genomic diversity, and develop tools for

the management and implementation of barley genomic diversity datasets. This

thesis comprises a literature review (Chapter 2) and five experimental chapters

(Chapters 3–7) (Figure 1-1).

2

Figure 1-1 Thesis structure

3

This thesis aimed to:

1. Confirm whether sympatric speciation occurred between wild barleys from

the south-facing and north-facing slopes of the ‘Evolution Canyon’ in Israel

and identify genomic regions and potential genes under environmental

selection.

2. Exploit genomic diversity in eleven cultivated barley and identify

domestication signatures of cultivated barley from wild barley.

3. Create a barley genomic variation database for the barley research

community to access barley genome reference information and genomic

variations identified from wild barley and cultivated barley.

4. Characterise the pattern of mutations induced by gamma-ray radiation.

5. Narrow down the causal genes for the high tiller number mutant.

4

Chapter 2 Literature review

Figure 2-1 Main content of literature review.

5

2.1 Barley genomic research

Barley is the fourth most important crop in the world, both in terms of grain

production and cultivation area according to the Food and Agricultural Organization

(FAO) of the United Nations (www.fao.org/faostat). In Australia, barley is the

second most important crop, where it is widely planted from Western Australia to

southern Queensland. In 2013, about 7.47 million tonnes of barley grains were

harvested from four million hectares in Australia.

Barley provides abundant energy and nutrition for humans and animals. Barley grain

is mostly used as animal fodder, while malting barley is an important raw material

for production of beer and certain distilled beverages. The remainder is used as a

functional food for humans. Barley can grow in diverse environments and even

extreme climates such as the Tibetan Plateau, where barley grain is a staple human

food (Harlan and Zohary, 1966). Barley grains contain the eight essential amino

acids (valine, isoleucine, leucine, threonine, phenylalanine, tryptophan, methionine,

and lysine) that are not produced naturally by human body. Barley also contains

soluble fibre, which plays an essential role in our digestive, heart and skin health, and

may improve blood sugar control and weight management.

Barley is one of the founder crops of the Old World and was domesticated from wild

barley (Hordeum spontaneum L.) around 10,000 years ago in the Fertile Crescent

and can grow in diverse environments extreme latitude like the Tibetan Plateau

(Harlan and Zohary, 1966). As a result, barley has been widely used as a model

organism to research plant adaptive evolutionary processes as well as a vital genetic

6

repository to explore plant abiotic and biotic stress tolerance (International Barley

Genome Sequencing et al., 2012).

2.1.1 Barley domestication and evolution

Barley (Hordeum vulgare L. subsp. vulgare) is a member of the grass family. It is

one of the earliest domesticated cereal crops, which supplied essential energy for the

first human civilisation in the Near East Fertile Crescent about 10,000 years ago

(Harlan and Zohary, 1966). Recently, the study of an old beer recipe indicated that

barley grain was used to brewing beer as early as 5,000 years ago, as determined

from archaeological evidence including brewing tools and chemical residues found at

the Mijiaya site in China (Wang et al., 2016). Beer making was also assumed to play

an important role in promoting the spread of barley cultivation from Western Eurasia

to Eastern Asia including China.

Cultivated barley is derived from its progenitor of wild barley (Hordeum vulgare L.

subsp. spontaneum), which has been found in multiple regions including Turkey,

Israel, Jordan, Iraq and Afghanistan (Harlan and Zohary, 1966). Compared with

cultivated barley, wild barley features brittle rachises that increase the possibility of

propagation in a natural environment. However, the fragile rachis is not an ideal trait

for artificial selection because it reduces grain yields due to seed dispersal. Loss of

seed dispersal is assumed to be a crucial step in the process of crop domestication

(Fuller and Allaby, 2009; Pourkheirandish et al. , 2015). In barley, the transition from

shattering to non-shattering pods occurred in the mutation of two adjacent and

complementary genes Btr1 and Btr2 on chromosome 3H (Pourkheirandish et al. ,

2015; Takahashi and Hayashi, 1964). These genes were involved in the independent

7

evolution of landraces in eastern and western Asia, which strongly suggests the

presence of at least two centres of independent domestication.

The Near East Fertile Crescent is a well-recognised barley evolution centre. It

spreads from the Levant in the west to regions around the Euphrates and Tigris

Rivers in the east. The second domestication of barley occurred 1500–3000 km east

of the Fertile Crescent due to the haplotype frequency difference among geographic

regions at multiple loci (Morrell and Clegg, 2007). The second domestication mainly

contributes diversity from Central Asia to the Far East while the diversity at the first

centre contributes most of the diversity in Europe and America. Recently, Tibet was

identified as another domestication centre of cultivated barley after analysing the

genotypic division of worldwide cultivated barley and wild barley from the Near

East and Tibet (Dai et al. , 2012). A polyphyletic origin was also proposed after

cluster analysis of 186 barley accessions (including wild barley and barley landraces)

from worldwide locations with chloroplast DNA microsatellite markers (Molina-

Cano et al., 2005).

2.1.2 Constructing the barley reference genome

Challenges associated with constructing the barley reference genome include its

large genome size and the complexity of genomic content. Barley is diploid

(2n=2x=14) with an estimated genome size of 5.1 Gb, about 13 times that of rice and

1.5 times that of human. Up to 84% of the barley genome is accounted for repetitive

sequence structures such as retrotransposon (International Barley Genome

Sequencing et al., 2012). Given the above challenges, it is not possible to generate a

high-quality barley genome reference using whole-genome shotgun sequencing

8

strategy, thought sequencing technology and sequence assembling algorism has

made huge progress in the past decades.

In 2012, a barley physical map with a total length of 4.83 Gb was generated using

high-information-content fingerprinting together with contig assembly (International

Barley Genome Sequencing et al., 2012). The physical map represented up to 98% of

the barley genome. It comprised 9,265 contigs generated by assembling 571,000

BAC clones, which can be represented by a minimum tilling path of 67,000 BAC

clones. Sequencing 5,341 gene-containing BACs and 304,523 BAC-ends provided

the fundamental information to integrate the physical map with a high-density

genetic map, whole-genome shotgun sequencing result, and transcript sequences

(Ariyadasa et al., 2014; Mascher et al. , 2013a). Integrating the above information

resulted in a barley framework with 26,159 high-confidence genes. This barley

genome framework provided a limited information for functional gene isolation and

molecular breeding in barley (Nakamura et al., 2016; Pourkheirandish et al., 2015;

Russell et al., 2016; Sato et al., 2016).

A high-quality reference for barley genome was attained in 2017 (Beier et al., 2017;

Mascher et al., 2017) and involved the cooperation of barley genetic researchers

from Australia, China, Germany, UK, US, Denmark and so on. Based on the

previous barley genome framework, 87,075 BACs covering the minimum tilling path

were selected for hierarchical shotgun sequencing and de novo assembly (Munoz-

Amatriain et al. , 2015). Super-scaffolds were generated by overlapping the adjacent

BAC assembly, and assigned to chromosomes based on multifaceted information

including a barley physical map, barley genetic map, optical map, and chromosome

9

conformation capture sequencing (Ariyadasa et al., 2014; Mascher et al., 2013a). The

resulting chromosome-scale assembly of the barley genome is about 4.97 Gb (Table

2-1), which consists of 6,347 super-scaffolds with N50 of 1.9 Mb (Beier et al., 2017;

Mascher et al., 2017). 39,734 high-confidence genes were annotated, based on gene

model prediction and evidence of RNA-seq and sequence homology from other

species.

Table 2-1 Assembly and annotation statistics of the barley genome (Mascher et al.,

2017)

Assembly control argument Value

Number and cumulative length of sequenced BACs 87,075 (11.3 Gb)

Length of non-redundant sequence 4.79 Gb

Number of sequence contigs 466,070

BAC sequence contig N50 79 kb

Number and cumulative length of BAC super-

scaffolds 4,235 (4.58 Gb)

Number and cumulative length of singleton BACs 2,123 (205 Mb)

Super-scaffold N50 1.9 Mb

Sequence anchored to the POPSEQ genetic map 4.63 Gb (97%)

Sequence anchored to the Hi-C map 4.54 Gb (95%)

Number of annotated high-confidence genes 39,734

Annotated coding sequence 65.3 Mb (1.4%)

Annotated transposable elements 3.70 Gb (80.8%)

2.1.3 Molecular markers and genotyping

10

Molecular markers have been widely used to compare the genetic difference between

individuals and it is a fundamental technology of modern genetic research including

linkage analysis and association mapping. Molecular marker resources in barley are

classified into three types: traditional molecular markers, array-based molecular

markers, and sequencing-based molecular markers.

Traditional molecular markers include Restriction Fragment Length Polymorphism

(RFLP) and Simple Sequence Repeats (SSR). RFLP was the first generation of

molecular markers used to construct genetic maps in barley (Sherman et al. , 1995).

Genotyping by RFLP involves hybridisation and isotope labelling, which is labour

intensive and harmful to operators’ health. SSR markers are more sensitive than

RFLP, and its operation is relatively easy because it is PCR-based. Three sets of

barley SSR markers were developed using an original method involving the

construction of a small insert genomic DNA library, hybridisation using tandemly

repeated oligonucleotides, and sequencing of positive clones (Liu et al., 1996;

Ramsay et al., 2000; Struss and Plieske, 1998). With the accumulation of sequences,

especially expressed sequence tags (EST) in public databases, software was designed

to predict SSR motifs for SSR marker development in barley (Thiel et al., 2003). In

2007, a barley SSR consensus map comprising 775 markers was built by integrating

SSR markers from six genetic maps (Varshney et al. , 2007). However, genotyping by

SSR markers was limited due to it being low density, labour intensive and time-

consuming, such that it cannot meet the demand for high density and high throughput

when genotyping large numbers of individuals.

11

To circumvent the limitation of traditional molecular markers, Diversity Array

Technology (DArT) and SNP arrays were developed based on microarray

technology. They detect DNA variation at hundreds to thousands of locations in

parallel. DArT works by detecting whether sample fragments are present in genomic

representation (Jaccoud et al. , 2001). In barley, DArT was first used to genotype an

F2 population derived from a Steptoe × Morex cross and resulted in a genetic map

comprising 384 DArT markers (Wenzl et al. , 2004). In the following two years,

2,085 DArT markers were developed from ten biparental populations and integrated

into a consensus map of medium density with 850 traditional markers (Wenzl et al. ,

2006). While genotyping by DArT is more efficient than traditional molecular

markers, it is unable to meet the high density required for constructing a genetic map

and the high resolution of gene mapping.

Single nucleotide polymorphisms are the most abundant type of molecular marker.

With the accumulation of sequence information from diverse germplasm in barley,

SNP information was increasingly exploited. Close et al. (2009) identified about

22,000 SNPs from publicly available ESTs. The performance of 4,596 selected SNPs

was tested on the Illumina GoldenGate Oligonucleotide platform, and the minor

allele frequency was estimated for each SNP in a collection of ~200 barley

germplasm. 3,072 SNPs with good performance and proper allele frequency were

selected and incorporated into two sets of SNP arrays. One of these arrays, with

1,536 loci, was used to genotype 500 selected UK barley cultivars, and association

mapping was performed for 15 morphology traits based on the genotype dataset

(Cockram et al., 2010). Two years later, another 220 spring barley cultivars collected

worldwide were genotyped using the same SNP array for a genome-wide association

12

study (Pasam et al., 2012). Although SNP arrays are useful for determining

genotypes in populations with large sample numbers, it can only detect variations in

certain genomic regions. Genotyping by SNP and DArT arrays is efficient and

reproducible, but they need to be developed first.

Sequencing-based markers do not need to be developed before they are used for

genotyping. Whole-genome sequencing can detect almost all sequence variants in the

whole genome. However, it is too expensive to adopt a whole-genome sequencing

strategy to genotype a large population in barley. RAD sequencing (Ren et al., 2016;

Zhou et al., 2015) is the most-used approach for genotyping a large number of

individuals. RAD sequencing only sequences the regions around restriction enzyme

recognising sites, which can reduce the complexity of the genome. Zhou et al. (2015)

conducted RAD sequencing for a DH population with 96 individuals from a cv. AC

Metcalfe × cv. Baudin cross. The sequencing library was constructed from genomic

DNA digested with restriction enzyme EcoRI for each sample. Clean reads were

mapped to the assembly of barley cv. Morex for SNP calling of each sample. Finally,

a high-density genetic map was constructed based on the genotype data of 13,547

SNP markers in the population. Based on this genetic map, a major quantitative trait

locus (QTL) controlling plant height was identified on chromosome 3H, which

explained 44.5% of the phenotype variants.

2.1.4 Linkage mapping

Most agricultural traits of cereal crops are controlled by QTLs, such as flowering

time, plant height, grain number and weight. The identification and molecular

dissection of these QTLs will provide fundamental knowledge and genetic resources

13

for rational design and molecular breeding of new crop varieties with high yield,

environmental stress resistance and superior grain quality (Zeng et al., 2017).

Linkage mapping and association mapping are two major approaches widely adopted

to isolate QTLs of agricultural importance.

Linkage mapping is a method for identifying functional QTLs in bi-parent

populations using genetic maps, which can be constructed by calculating the rate of

recombination between each pair of molecular markers. The closer two genes are in

physical distance, the lower the rate of recombination. The advantage of map-based

cloning is that it does not require knowledge of the protein product of genes which

control the traits. In the past decades, linkage mapping has been adopted to identify

QTLs controlling diverse agronomic traits in barley, such as seed dormancy (Han et

al., 1996; Kleinhofs et al., 1993; Nakamura et al., 2016; Sato et al., 2016), nonbrittle

rachis (Komatsuda and Mano, 2002; Komatsuda et al., 2004; Pourkheirandish et al. ,

2015), and powdery mildew resistance (Buschges et al. , 1997; Hinze et al. , 1991). To

dissect the genetic basis of seed dormancy in barley, a genetic map comprising 295

traditional molecular markers was constructed using a double haploid line (DHL)

population with 150 individuals from a cv. Steptoe (high level of dormancy) × cv.

Morex (no dormancy) cross, from which four QTLs regulating seed dormancy (SD1,

SD2, SD3, SD4) were identified (Kleinhofs et al., 1993). SD1 and SD2 were further

validated and fine mapped using three F2 segregation populations whose parents

were selected from previous DHLs (Han et al., 1996). Eventually, SD1 was identified

to encode an alanine aminotransferase (AlaAT) (Sato et al., 2016), while the causal

gene SD2 encodes Mitogen-activated Protein Kinase Kinase 3 (MKK3) (Nakamura

et al., 2016). The nonbrittle rachis trait was formed in the domestication process of

14

cultivated barley from wild barley and two closely linked QTLs (Brt1 and Brt2) were

identified to co-control this phenotype using an F8 recombinant inbred line (RIL)

population with 215 individuals derived from a cv. Azumamugi (AZ) × cv. Kanto

Nakate Gold (KNG) cross genotyped using 272 markers (Komatsuda and Mano,

2002). The two QTLs were targeted within a 0.19 cM region on chromosome 3H

based on an F2 segregation population with 10,084 individuals, which covers a

genomic region of 403 kb (Pourkheirandish et al., 2015). More SNP markers were

developed in the candidate region, and two enlarged F2 segregation populations

(14,058 and 12,257 individuals) were created. According to recombination analysis,

Brt1 was delimited to a 1.2 kb interval with only one ORF (ORF1) and Brt2 in a 4.9

kb with three ORFs (ORF2, ORF3, ORF4). Sequence analysis of the three candidate

genes for brt2 between two parents revealed that an 11 bp deletion in ORF3 causes a

frameshift in cv. Azumamugi (Pourkheirandish et al., 2015).

2.1.5 Association mapping

Association mapping is a method for identifying functional QTLs within a natural

population based on linkage disequilibrium (LD) between functional loci and

molecular markers. LD refers to the non-random association of alleles of two genes

or markers. Compared with linkage mapping, association mapping has several major

advantages. Firstly, more alleles can be investigated in a natural population using

association mapping, compared with only two alleles in a bi-parent population using

linkage mapping. Secondly, recombination events in the natural population are more

abundant than that in biparental population because they have accumulated over

millions of generations. The level of the recombination is a major factor that

15

determines the resolution of gene mapping. Thirdly, it is relatively quick and easy to

collect natural populations than to create a crossing population, which is labour

intensive and time-consuming. Given the above advantages, association mapping is

considered a promising tool for dissecting the genetic basis of complex agronomic

traits in crops. With the advances in high-throughput genotyping by sequencing,

association mapping has been successfully used for functional QTL mapping in

plants over the last decade, including Arabidopsis (Atwell et al., 2010), maize

(Buckler et al., 2009), and rice (Huang et al., 2010).

In barley, association mapping has been used to identify functional QTLs related to

morphological traits (Cockram et al., 2010; Pasam et al., 2012; Wang et al. , 2012),

disease resistance (Richards et al., 2017; Roy et al., 2010; Sallam et al., 2017;

Turuspekov et al., 2016), drought tolerance (Wehner et al., 2015), and soil acid and

salinity tolerance (Fan et al., 2016; Zhou et al., 2016). To identify QTLs underlying

morphological traits, Cockram et al. (2010) conducted a genome-wide association

study with a natural population consisting of 500 UK barley cultivars. The

association panel was genotyped using 1,536 genome-wide SNPs, and 32

morphological traits were scored and used to scan for marker-trait associations.

Eighteen QTLs were identified and had significant associations with 15 traits. To

identify QTLs controlling net blotch resistance in barley, Richards et al. (2017)

collected 957 barley core germplasm genotyped with 6,525 SNPs; 16 unique

genomic loci had significant associations with net blotch disease resistance and five

had novel QTLs. Improving the drought tolerance of crops can reduce grain

production losses due to water shortages. To identify QTLs underlining drought

tolerance, Wehner et al. (2015) conducted a genome-wide association study with 156

16

winter barley germplasm. The association panel was genotyped using the Illumina 9

k iSelect SNP chip. Two major QTLs controlling drought stress and leaf senescence

were delimited on chromosome 2H and 5H. To dissect the genetic basis of barley

salinity stress tolerance, Fan et al. (2016) conducted a genome-wide association

study with 206 barley accessions collected worldwide which were genotyped using

408 DArT markers. One of 24 QTLs was commonly detected with significant

marker-trait association with salinity stress tolerance.

17

2.2 DNA sequencing technology and application

DNA is the main carrier of genetic information, and eventually determines the amino

acids of each protein in a biological system. DNA sequencing is a process to

determine the order of the four bases in one piece of DNA, being adenine (A),

thymine (T), cytosine (C) and guanine (G). The first attempt to characterise a DNA

sequence was conducted in bacteriophage λ, in which the nucleotide composition of

the 5’-terminated strand was determined (Wu and Kaiser, 1968). Wu (1970)

developed a DNA sequencing method based on a primer-extension strategy and

determined the DNA sequence of the cohesive ends of bacteriophage λ. This DNA

sequencing method was improved to become the well-known ‘first-generation DNA

sequencing’ (Maxam and Gilbert, 1977; Sanger and Coulson, 1975; Sanger et al. ,

1977). The past half-century has witnessed dramatic advances in both DNA

sequencing technology and its application to various aspects of life science. In this

part, I will start with comparing the advantages and disadvantages of the existing

major DNA sequencing platforms summarising the achievements of building the

reference genome, detecting genomic variation, and deciphering the gene expression

profile based on DNA sequencing ( 2-2).

18

Figure 2-2 DNA sequencing technology and application.

2.2.1 DNA sequencing technology

First-generation DNA sequencing also refers to chain-terminating sequencing or

capillary sequencing. In this method, the chemical analogues di-deoxynucleo-

tidetriphosphates (ddNTPs) of four types of deoxyribonucleotides (dNTPs) are used

to inhibit the elongation of DNA synthesis. Compared with dNTPs, ddNTPs lack the

3’ hydroxyl group that is required to form a phosphodiester bond between the DNA

terminal and the next dNTP ( 2-3 a). Four types of ddNTPs are labelled with

different fluorescent dyes and mixed into a DNA synthesis reaction at a certain

concentration. ddNTPs are incorporated randomly into the synthesis reaction, and

polynucleotides of different length with ddNTPs in their 3’ terminals are generated.

The resulting polynucleotides are analysed with capillary electrophoresis and a

fluorescent detector ( 2-3 b). ABI3730 is a widely used capillary sequencing machine

19

these years, with a sequencing length up to 1100 bp with base accuracy as high as

99.999%. First-generation sequencing technology is challenged by slow data

generation, being about 3.4 Mb per 24 h for each ABI3730 sequencer. Because only

one DNA template can be added to each reaction and 384 reactions can be run

simultaneously at each term of about 3 hours.

Figure 2-3 Diagrams of chain termination sequencing.

a 3’-5’-phosphodiester bond in the process of DNA synthesis; b primer extension

with ddNTPs and determining DNA sequence (Garrido-Cardenas et al., 2017).

20

The output from DNA sequencing has made a quantitative leap with the introduction

of massive parallel sequencing technology, designated ‘second-generation

sequencing (SGS)’. SGS has made it possible to sequence thousands of millions of

DNA templates simultaneously by spatial separation. While there are several

commercially available SGS platforms in the sequencing market, the Illumina

Genome Analyzer is the most popular one. Illumina sequencing uses a sequencing-

by-synthesis approach and reversible dye-terminator technology. The steps involved

in Illumina sequencing include preparing the sequencing library, attaching DNA

templates to the surface of flow cell channels, generating template clusters by bridge

amplification, incorporating one nucleotide labelled with fluorescence, imaging for

base calling, and removing the 3’ blocking group. At present, Illumina has a range of

sequencers available—including MiSeq, NextSeq, HiSeq and NovaSeq—

(https://www.illumina.com). NovaSeq can generate a maximum output (6000 Gb) in

about 44 h with up to 20 billion reads (paired-end 150 bp) in each run. That equates

to about 3272 Gb per day, or about one million times the speed of the AB370

capillary sequencer. SGS has dramatically increased the speed of DNA sequencing

and decreased the cost. The limitation of SGS is the short length of output reads,

which is challenging when assembling complex genomes with large portions of

repetitive sequences.

https://www.illumina.com/systems/sequencing-platforms.html

21

Figure 2-4 Illumina/Solexa sequencing (Mardis, 2008)

22

To overcome the challenge of read length in SGS, a new concept of single-molecular

real-time (SMRT) sequencing was proposed. Sequencing technologies based on this

concept are classified as ‘third-generation sequencing (TGS)’. In SMRT, DNA

polymerase is immobilised and deoxyribonucleoside triphosphates (dNTPs) are

labelled with four different fluorescence colours. The fluorescence of a dNTP can be

detected when it is recruited by the immobilised DNA polymerase to incorporate into

the end of a growing DNA strand ( 2-5). The challenge of TGS is that the sensitivity

of the signal of the recruited dNTP is weakened by the background signal from

dissociative dNTPs. This challenge has been circumvented by zero-mode waveguide

(ZMWs) technology invented by Pacific BioScience (PACBIO) ( 2-5). ZMWs

provide nanostructures for each DNA synthesis reaction in a separate volume so that

the background signal can be reduced and millions of reactions can be processed

simultaneously (Eid et al. , 2009). Because the DNA templates are PCR-free and the

process of DNA synthesis is uninterrupted, the DNA templates can be sequenced

from start to end. The length of sequences produced by SMRT sequencing can be as

long as 70 kb (Figure 2-6).

23

Figure 2-5 PACBIO SMRT technology.

A, the structure of ZMW; B, the process of signal detection in the process of dNTP

incorporation (Eid et al., 2009)

Figure 2-6 Distribution of read length in SMART sequencing (from PACBIO

website)

24

Table 2-2 Performance comparison of dominant sequencers

Method Read length Accuracy Reads per run Time per run

US dollar per 1

Mb

Comments

Sanger sequencing

()ABI3730

400–900 bp 99.90% 384 20 min to 3 h $2,400 Low throughput,

high quality

Hiseq serials

(Illumina)

50–300 bp 99.90% up to 3 billion 1 to 11 days $0.05–0.15 High throughput,

cheap, short reads

Ion semiconductor

(Ion Torrent) up to 600 bp 98% up to 80 million 2 h $1.00

High throughput, a

bit expensive

25

Method Read length Accuracy Reads per run Time per run

US dollar per 1

Mb

Comments

SMRT (PACBIO) 10–40 kb 87% 50,000 per

SMRT cell 30 min to 4 h $0.13–0.60

Long reads,

relatively cheap

Nanopore (Oxford) up to 500 kb 92–97% NA 1 min to 48 h $500–999 per

Flow Cell

Long reads,

convenient

26

2.2.2 Constructing reference genome

The high quality of the reference genome sequence is fundamental for genomic

research. According to NCBI/genome statistics, by the end of 2017, genomes from

34,605 organisms have been sequenced and assembled at different levels (contigs,

scaffolds or chromosomes), and these include 5,135 eukaryotes, 130,201 prokaryotes,

and 14,001 viruses. Within the eukaryotes, there are 2,777 fungi species, 1,298

animal species and 495 plant species. The genomes of major crops such as wheat

(International Wheat Genome Sequencing, 2014), maize (Jiao et al., 2017; Schnable

et al., 2009), rice (Goff et al. , 2002; International Rice Genome Sequencing, 2005),

barley (Dai et al. , 2018; Mascher et al., 2017), foxtail millet (Zhang et al., 2012),

sorghum (Paterson et al., 2009), rye (Bauer et al., 2017), rapeseed (Chalhoub et al. ,

2014), and soybean (Schmutz et al., 2010), and common model organisms such as

Arabidopsis (Arabidopsis Genome, 2000) and purple false brome (Vogel et al. , 2010),

have all been sequenced (Table 2-3).

27

Table 2-3 Genome assembly and gene prediction of major crops and model organisms

Common

name Scientific name Chromosomes

Genome size (Mb)

Assembly size (Mb)

Predicted gene number

Assemblies Reference

Arabidopsis Arabidopsis

thaliana 5 125 119 25,498 10 (Arabidopsis Genome, 2000)

Rice Oryza sativa 12 389 374 37,544 23 (International Rice Genome

Sequencing, 2005)

Sorghum Sorghum bicolor 10 730 709 27,640 4 (Paterson et al., 2009)

Maize Zea mays 10 2,300 2,135 32,540 10 (Jiao et al., 2017; Schnable et al.,

2009)

Purple false brome

Brachypodium distachyon

5 355 272 25,532 4 (Vogel et al., 2010)

Soybean Glycine max 20 1,100 978 46,430 2 (Schmutz et al., 2010)

Foxtail

millet Setaria italica 9 491 423 38,801 2 (Zhang et al., 2012)

Rapeseed Brassica napus 19 1,130 976 101,040 2 (Chalhoub et al., 2014)

Bread wheat Triticum aestivum 21 17,000 15,000 124,201 15 (International Wheat Genome

Sequencing, 2014)

Barley Hordeum vulgare 7 5,100 4,790 39,734 8 (Mascher et al., 2017)

Rye Secale cereale 7 7,900 2,800 27,784 2 (Bauer et al., 2017)

28

Most draft genomes were completed using one of the three major sequencing

methods—whole-genome shotgun sequencing (Goff et al. , 2002), map-based clone

sequencing (International Rice Genome Sequencing, 2005) or chromosome

conformation capture sequencing (Mascher et al. , 2017). Whole-genome shotgun

sequencing is a fast and efficient method to obtain a draft genome, whereby genomic

DNA is randomly fragmented and ligated with universal sequencing adapters before

being sequenced. Goff et al. (2002) completed the first draft genome of rice using

this method, with 42,109 contigs covering 389,809,244 bp assembled from 5.5

million reads, and about 40,000 genes predicted. Whole-genome shotgun sequencing

is an efficient method for obtaining the genome sequence of regions with low copy,

but it fails to assemble sequences in regions with a high copy of repetitive structures.

The rice reference genome was completed using the map-based clone sequence

method (International Rice Genome Sequencing, 2005). Firstly, a physical map was

established by fingerprinting BACs from nine independent genomic libraries together

with clone-end sequencing. Secondly, 3,453 representative clones selected according

to the physical map were randomly fragmented and sequenced with the shotgun

sequencing method to about 10-fold and assembled. Thirdly, assemblies of adjacent

clones were overlapped to generate super-scaffolds and the sequence gaps filled

using long-range PCR sequencing. Finally, the assembly of 12 chromosomes

generated a pseudomolecule genome with a total length of about 370 Mb, with only

36 estimated gaps.

29

Barley has a large genome size (estimated 5.1 Gb), about 13 times that of the rice

genome. Moreover, up to 80% of barley genome is accounted for by repetitive

structure sequences. Large genome size and complex genome content make it more

difficult to generate a high-quality genome assembly. The dramatic advance of DNA

sequencing and relevant molecular biology technologies, especially three-

dimensional chromatin conformation capture sequencing, offer new insight and

convenience for assembling the barley genome (Jin et al., 2013). A high-quality

barley genome was achieved in 2017 by the International Barley Sequencing

Consortium (Mascher et al., 2017). To overcome the challenge of abundant repetitive

elements in the barley genome, the most advanced technologies were used, including

hierarchical shotgun sequencing, optical mapping, and chromosome conformation

capture sequencing. Eventually, a pseudomolecule genome of barley was assembled

with a total length of 4.79 Gb, with 39,734 high-confidence genes predicted from the

pseudomolecule sequence with the supporting evidences including transcriptome

sequences and homology proteins which have been well-characterised and saved in

the public databases.

2.2.3 Detecting genetic variants

The relationship between genetic variants and phenotype are the major focus of

genetic research. High-throughput DNA sequencing makes it possible to detect

genetic variants in the whole genome. Genetic variants can be detected in spececial

genomic regions with different densities by choosing whole-genome sequencing

(International Barley Genome Sequenc ing et al., 2012; Zeng et al., 2015),

transcriptome sequencing (Dai et al., 2014), exon sequencing (Mascher et al. , 2013b;

30

Russell et al., 2016) or restriction site associated DNA sequencing (Zhou et al. ,

2015). To exploit the natural diversity in barley, four barley cultivars (Bowman, Igri,

Barke and Haruna Nijo) and one wild barley were sequenced at a depth of 5 to 25

folds using the whole-genome shotgun strategy. Clean reads for each accession were

mapped to the cv. Morex assembly, which identified about 15 million SNPs

(International Barley Genome Sequencing et al., 2012). Zeng et al. (2015) conducted

whole-genome sequencing on five Tibetan wild barleys and five Tibetan cultivated

barleys with an average coverage of 16.4-fold. About 36 million SNPs and 2.28

million short InDels were identified. Due to the large size of the barley genome,

detecting genomic diversity by whole-genome sequencing is not affordable for most

researchers with limited funding. Compared with whole-genome sequencing,

uncovering genetic diversity by transcriptome sequencing is more efficient. To study

the origin of cultivated barley, Dai et al. (2014) conducted transcriptome sequencing

of 21 barley accessions—12 wild barleys and nine cultivated barleys—which

generated about 48 Gb of clean data. After mapping these clean reads to the cv.

Morex assembly, 55,973 unique SNPs were identified. Exome sequencing is another

method that focuses on the gene coding region. Mascher et al. (2013b) developed a

barley exome capture array that can be used to enrich fragments in the coding region.

Russell et al. (2016) sequenced 267 barley landraces and wild accessions using the

exome sequencing strategy to study the environmental adaptation of barley, which

identified about 1.7 million SNPs and 143,872 short InDels.

2.2.4 RNA-seq for analysing gene expression level

31

RNA-seq refers to whole transcriptome sequencing, which provides a new strategy

for comparing RNA transcripts, alternative gene splices, changes in gene expression,

co-expression networks, and gene fusion. Prior to RNA-seq, gene expression changes

were usually analysed with gene expression microarrays, which is based on molecule

hybridisation technology. The experimental steps for preparing RNA sequencing

libraries are as follows: (1) total RNAs is extracted from the sample, which is then

treated with DNase I to avoid DNA contamination; (2) mRNA is enriched using

beads with Oligo (dT) complementary with poly-A at the 5’end of mRNA for

eukaryotes; (3) enriched mRNA fragments are fragmented into smaller pieces before

synthesising the first strand and second strands of cDNA; and (4) the double strand

cDNA is end-repaired and the sequencing adapter added to both ends before being

sequenced on next-generation sequencing platforms.

Discovering new transcripts is an important aim of RNA-seq for biologists. For

organisms with a complete genome sequence, transcripts can be assembled using the

genome sequence as a guide. In this approach, the first step is to align short reads of

RNA-seq with the genomic DNA sequence. The biggest challenge for aligning RNA-

seq reads with genomic DNA is spliced alignment. This is because genes in

eukaryotic genomes contain introns and mature mRNA does not have these introns.

To overcome this challenge, Tophat2, the latest RNA-seq alignment software,

divides this process into three steps (Kim et al., 2013). Tophat2 begins by aligning all

reads with the provided transcriptome with mismatch less than three bases. Then the

unmapped reads (from transcriptome) are aligned with the genome sequence and

reads spanning a single exon are kept. For multi-exon spanning reads, the exons are

32

split into 31 bp segments and aligned to the reference sequence. The resulting

alignments are taken as input files for transcript assembly using Cufflinks software.

Alternative splice events can be detected by SOAPsplice (version 1.10) with the

highest call rate (detected true junction number/true junction in total) and the lowest

false positive rate (detected false junction number/the number of junctions all

detected) compared with TopHat, SpliceMap and MapSplice (Huang et al., 2011a).

The general steps in this process are summarised as follows. The intact alignments

are those aligned perfectly to the reference, with the remainder referred to as

unmapped reads (IUM). IUMs are spliced into two exon segments of 30 bp and

realigned to the reference allowing no gaps and less than 5’ mismatch. The distance

between these two mapped positions should range from 50 bp to 50 Mb, and the

boundary of introns should follow the rule of GT-AG, GC-AG, AT-AC (Burset et al.,

2000). Based on the detected splice junctions, alternative splice events are identified

according to the definition.

2.2.5 ChIP-seq determining protein binding sites

The ChIP-seq strategy combines chromatin immunoprecipitation (ChIP) with next-

generation sequencing to determine whether the target protein interact with specific

genomic regions. ChIP is used to investigate the interaction between proteins and

DNA. Studying the way proteins interact with DNA will help to understand how the

target protein affects certain phenotypes. The main aim of ChIP-seq is to identify

specific DNA-binding sites.

33

The key step in ChIP-seq is to enrich the target DNA sequences which are bound to

the specific protein using ChIP. The workflow of ChIP is briefly as follows: (1)

DNA-protein complexes are formed in living cells or tissues and then sheared into

about 500 bp DNA fragments; (2) target DNA fragments cross-linked with proteins

are specially immunoprecipitated using protein-specific antibodies; (3) the bound

DNA fragments are released from DNA-protein complexes and purified; and (4) the

enriched target DNA fragments are used to prepare sequencing libraries that are

sequenced using next-generation sequencing technology.

To identify enhancer sequences which regulate spatial and temporal expression of

genes in humans and mice, the antibodies of an enhancer-associated protein p300

was used to enrich the DNA-p300 complex after cross-linking, chromatin extraction

and sheared by sonication and then sequenced on the Hiseq2000 platform (Visel et

al., 2009). Several thousand genomic binding sites were identified for p300 in mouse

midbrain, embryonic forebrain, and limb tissue. This was confirmed by testing the

sequences of 86 of these binding sites in a transgenic mouse assay (Visel et al.,

2009). Their study confirmed that ChIP-seq technology is an accurate approach for

identifying and isolating regulatory sequences such as enhancers.

2.2.6 Bisulfite sequencing for detecting methylation

DNA methylation is a typical epigenetic phenomenon, which complements the

determination of biological phenotype. Methylation in the promoters or within the

transcribed regions of the gene is highly correlated with gene transcription level

(Zhang et al., 2006b; Zilberman et al., 2007). The aim of DNA methylation

34

sequencing technology is to identify the methylation pattern of genomic DNA, which

is based on sodium bisulphite treatment.

Figure 2-7 Schematic diagram of DNA methylation sequencing (Tollefsbol, 2011).

Bisulphite treatment is the key point of methylation sequencing. Normal cytosine

residues are converted into uracil after DNA fragments are treated with bisulphite.

However, the bisulphite does not affect the cytosine residues with 5-methylation. It is

assumed that all unmethylated cytosines in the whole genome are converted to uracil

35

after being treated with bisulphite. As a result, it is possible to identify the

methylation status of CpG dinucleotides to sequencing bisulphite-treated genomic

DNA. Because methyl-cytosines appear to be cytosines while unmethyl-cytosines

appear to be thymines after PCR amplification in the process of NGS sequencing

(Frommer et al., 1992).

To study the DNA methylation pattern in Arabidopsis , a study was conducted to

generate a map of cytosine methylation at single-base-pair resolution using the

shotgun bisulphite sequencing approach (Zhang et al. , 2006b). The genomic DNA

was treated with sodium bisulphite to convert unmethylated cytosines to uracils and

then sequenced on the Illumina sequencing platform. After sequencing, about 20-fold

coverage of clean reads was produced with 1.7% CHH, 6.7% CHG and 24% CG

methylation in their experimental condition.

36

Chapter 3 Sympatric speciation and adaptive

evolution of wild barley in

“Evolution Canyon”

Abstract

Sympatric speciation is a highly contentious concept and it is under strong debate

whether natural selection can outperform the continuous homogenizing force of free

gene flow in a sympatric ecological condition and generate reproductive

incompatibility. Here, we provide substantial new genomic evidence for adaptation-

driven incipient sympatric ecological speciation of two wild barley populations

within 250 m from the ‘Evolution Canyon’ (EC), consisting of two abutting slopes

with drastically different microclimates. Phylogenetic relationship and population

structure analysis uncovered extraordinary genetic divergence between inter-slope

EC populations, considerably greater than between the Tibetan wild and cultivated

barleys. A genome-wide FST analysis revealed detailed information on the genomic

differentiation with a total of 243 Mb genomic regions under environmental

selection, representing ~5% of the barley reference genome. One prominent

‘chromosome island’, spanning a continuous region of around 200 Mb, was

identified on chromosome 3H under strong natural selection. A homologous gene

involving the indica-japonica rice speciation located in the selection sweep region

with unique amino acid substitution between the two populations. In-situ

hybridization results demonstrated potential chromosome structure changes in this

region. RNA-seq analysis demonstrated that the genotype from one slope has

37

evolved unique mechanisms in responding to heat and drought. Our study provides

comprehensive genomic evidence and new insights to understand the process of

incipient sympatric speciation of wild barleys.

38

3.1 Introduction

Wild barley (Hordeum spontaneum L.) is the ancestor of cultivated barley (Hordeum

vulgare L.) with a wide geographic distribution across highly diverse environments

throughout the Near East, from the Eastern Mediterranean coasts to Tibet (Zohary et

al., 2012). Genomic resources obtained from wild crop relatives have dramaticly

contributed to enhancing crop productivity (Hajjar and Hodgkin, 2007; Nevo, 2012)

and understanding the genetic basis of adaptation and speciation. The evolutionary

processes that drive the extraordinary adaptive genomic divergence in wild crops

hold great potential for crop improvement (Nevo, 2012; Nevo et al. , 1992; Tester and

Langridge, 2010).

The high genetic and environmental heterogeneity across the distribution range of

wild barley suggests local adaptation and genomic differentiation along macro and

micro environmental gradients. The ‘Evolution Canyon’ I (EC), located in Lower

Nahal Oren, Mount Carmel in Israel, is a microsite model for biodiversity evolution,

adaptation and incipient sympatric speciation across life (Nevo, 1995, 2006; Nevo,

2014a). Different biotic and abiotic stress pressures at EC make it well suited to

characterize the causes and driving forces of evolution in action (Nevo, 2012; Nevo,

2013). EC consists of two abutting slopes separated (on average) by a distance of

only 250 m: the temperate, forested, cool and humid, ‘European’ north-facing slope

(NFS) and the tropical, savannoid, hot and dry, ‘African’ south-facing slope (SFS).

The geology and macro climates are very similar on the opposing slopes, but they

differ sharply in their micro climates ( 3-1). The SFS features extreme levels of solar

radiation (200‒800% more than NFS) with higher daily temperatures and drought,

39

whereas NFS is characterized by a cooler climate with a higher relative humidity (1–

7%) than SFS (Nevo, 1995, 2012; Pavlícek et al., 2003). The sharp micro climatic

divergence between the abutting slopes (Pavlícek et al., 2003) has resulted in

contrasting populations of microbial, animal, fungi and plant species (Nevo, 2014a).

Figure 3-1 Cross-section view and Sample collection sites of the “Evolution

Canyon” I (EC-I), Lower Nahal Oren Mount Carmel.

a, the microclimate model of the “Evolution Canyon” indicating sunshine reception

difference of the opposing slopes. b, the North-facing slope (NFS) and the South-

40

facing slope (SFS) in the “Evolution Canyon”, with an average distance of about 250

meters. c, an air overview of the “Evolution Canyon”. d, Wild barley, Hordeum

spontaneum, growing on the Natufian terrace and cemetery of the Oren cave at

“Evolution Canyon”. (from Nevo, 2012 )

We investigated the genomic and transcriptomic signatures of microevolution of wild

barley from EC, collected from mid-stations on the abutting slopes along the North–

South microclimatic gradient of EC. We used high coverage (~40X) whole genome

sequencing, with a comparison to Tibetan wild as well as cultivated barley varieties.

This study highlights genetic variants (SNPs and short insertions/deletions) including

genes with significant mutations (missense, lost or gained start or stop codons)

identified using the map-based reference sequence of the barley genome (Mascher et

al., 2017) as a reference. Genome-wide RNA profiling identified differentially

expressed genes (DEG) between NFS and SFS. This study provides tangible

evidence of ecological stress as a force that generates and maintains local adaptive

patterns, leading to incipient sympatric adaptive ecological speciat ion, and linking

micro and macro-evolutionary processes in wild barley populations.

41

3.2 Methods

Figure 3-2 Diagram of methods and aims

Sample collection.

The two slopes of ‘Evolution Canyon’ (EC) have seven sampling stations ( 3-1). Ten

barley accessions used in this study for transcriptome sequencing were independent

42

accessions collected from SFS station #2 and NFS station #6. The ten barley

accessions used in this study for whole-genome sequencing were from four different

groups: Tibetan wild barley (W1 and X1), Australian and Canadian malting barley

cultivars (Baudin and AC Metcalfe), and wild barley from EC, Mount Carmel, Israel

(South Face Slope station SFS2.1, SFS2.2, SFS2.3 and North Face Slope station

NFS6.1, NFS7.1 and NFS7.2). The Tibetan wild barleys were collected on the

Tibetan Plateau by T.W. Xu (Xu, 1982). The two commercial barley cultivars

Baudin (Australia) and AC Metcalfe (Canada) are premium malting quality varieties.

RNA sequencing and differential expression gene analysis

Wild barley seeds were sown in pots filled with peat under well-watered conditions

in glasshouse with the normal grow condition. When the plants reached the third leaf

stage, watering was ceased for the drought treatment but continued f or the controls

for a further 80 days. Fully expanded leaves on both the control and drought plants

were sampled with three replicates and flash frozen in liquid nitrogen for RNA

extraction. Total RNA was extracted using TRIzol Reagent (Invitrogen, Carlsbad,

CA, USA) and purified using the RNeasy Mini Kit (Qiagen, Germantown, MD,

USA). 2×150 bp paired-end sequencing libraries were constructed and sequenced on

an Illumina HiSeq2000 platform (Illumina, San Diego, CA, USA) as described

previously (Dai et al., 2014).

From an initial 368,900,936 paired-end raw reads, 295,697,023 clean reads were

retained after filtering. Retained pair-ended reads were aligned against the up-to-date

version of the Morex barley pseudomolecule sequences using the open source Kmer

adaptive aligner Biokanga (version 3.8.7) using default parameters. A count matrix

43

was generated and loaded into R (version 3.3.1) for downstream statistical analysis.

Read counts were normalized using the model-based Trimmed Mean of M-values

(TMM) method of Robinson and Oshlack (Robinson and Oshlack, 2010), followed

by fitting of a negative binomial generalized linear model (GLR) accommodating the

genotype differences and effects induced by the drought treatment. The edgeR

package (Robinson et al. , 2010) (version 3.14.0) was used for transcript differential

expression analysis. A total of 21,122 transcripts were included in the differentially

expressed (DE) analysis with more than ten reads for at least three samples. A

likelihood ratio test for comparing the genotypes was conducted to obtain DE

transcripts, and the resulting P-values were adjusted for multiple testing using

Hochberg’s FDR adjustment approach (Benjamini and Hochberg, 1995). Transcripts

were considered highly differentially expressed when the adjusted P-value was

<0.05. Meanwhile, GO enrichment analysis was performed on the differentially

expressed genes. The gene subsets were compared to the high confidence gene set of

barley genome reference (Mascher et al., 2017) to determine enriched or depleted

gene ontology terms. For the enrichment analysis, the free and open-source R

package GOstats (Gentleman et al. , 2004) (hypergeometric test) was used with p-

value cutoff of 0.05.

Whole-genome sequencing.

Genomic DNA was extracted from the leaves of ten barley plants, one plant per

accession. The DNA was analyzed for quality and concentration using agarose gels

and an Agilent Technologies 2100 Bioanalyzer (Agilent Technologies, Santa Clara,

CA, USA). Genomic DNA was fragmented and then used to prepare Illumina paired-

44

end libraries, following the manufacturer’s instructions (Paired-End Sample

Preparation Guide, Illumina, 1005063). Insert fragment lengths of 500 bp, 700 bp

and 1 kb were generated. A final set of 48 libraries was then sequenced on the HiSeq

2000 platform in BGI-Shenzhen (Beijing Genomics Institute-Shenzhen, Shenzhen,

China).

Genome assembly

All reads were filtered for quality using Trimmomatic (Bolger et al., 2014), with a

window of 4 bp requiring an average >Q20 and minimum read length of 50 bp.

Genome assemblies for each of the eight accessions were performed at the Pawsey

Supercomputing Centre (Murdoch University, Australia) using 32 GB memory and

six cores, using SOAPdenovo2 (Luo et al., 2012) with k=51. The assembly was

assessed by the figure of N50 and the percentage of the contigs which could be

mapped to the barley reference genome using BLASTN with the similarity no less

than 85%.

Variant calling and validation.

The barley pseudo-molecular genome (Mascher et al., 2017) was used as the

reference to map the clean reads using BWA-MEM (Li and Durbin, 2010) with

default parameters. PCR Duplications were marked and removed using Picard

(version 1.129). Only those reads having unique mapping positions in reference

genome were remained and proceeded to detect genomic variation (SNPs and

InDels). Sequence variations (insertions, deletions and SNPs) were detected by

running three rounds of SAMTools (version 1.7) plus BCFTools (version 1.7) and

45

the Genome Analysis ToolKit (GATK,version 3.8) variants calling pipeline. Briefly,

the first round was performed with SAMTool plus BCFTools and filtered based on

the mapping quality and variants calling quality. The result of the first round was

used as the guide of realignment around potential InDel and variants calling for the

second round in GATK. The common variants detected by both SAMTools plus

BCFTools pipeline and GATK pipeline were used to guide variants calling in the

third round using GATK.

Population structure analysis.

The phylogenetic tree was constructed from the neighbour-joining method using

PHYLIP (version 3) based on evolutionary distance, which was computed with the

F84 model using the DNADIST utility in PHYLIP package. The tree file was

constructed and annotated using MEGA (Tamura et al., 2013) (version 6) and

TreeView (version 1.6.6). Principle component analysis was performed with the

EIGENSOFT program (version 6). Subpopulation components were estimated using

the ADMIXTURE program (version 1.23) with the K cluster ranging from 1 to 6.

Selective sweep analysis.

Selective sweep analysis is one of the main methods for detecting environmental or

natural selection signatures. Genomic regions under selective sweeps due to

environmental stresses on opposing slopes were measured using the fixation index

FST. Selective sweep analysis was conducted by measuring the patterns of allele

frequencies in each 1 Mb fragment along chromosomes using the whole-genome-

46

wide SNP variation data. Genomic regions with FST values above 0.89 were

considered under strong selective sweeps.

Function and pathway annotation.

Function and pathway annotation based on protein homology was performed using

an automatic functional annotation and classification tool (AutoFACT). Briefly,

coding sequences were aligned to NT database using program blastn and to protein

databases including NR, Swiss-Prot, KEGG using program blastx with the same

threshold. And then the function and pathway annotations of the best hits with a

biological informative description were assigned to query sequences.

47

3.3 Result

3.3.1 Dramatic divergence between wild barleys from SFS

and NFS

To investigate the genetic diversity of wild barleys in the ‘Evolution Canyon’ I (EC)

in Israel, transcriptome sequencing was performed with ten samples from SFS and

NFS ( 3-1) in this study. About 3.70 Gb clean sequencing data was obtained for each

wild barley accession on average. A total of 218,709 SNPs and 8,640 InDels detected

for these ten barley accessions in transcriptome level. SNP number for individuals

ranges from 92,556 to 119,675 and InDel from 4,174 to 5,407. The SNP frequency

distribution of ten wild barleys in each 1 Mb region along seven chromosomes were

showed in a circus plot ( 3-3 a). Overall, the SNP frequenc ies within distal regions

are higher than those in centromeric regions. The distribution patterns are similar for

wild barley accessions from each slope but different for accessions from opposing

slopes.

48

Figure 3-3 Genomic diversity and evolution analysis based on RNA sequencing of

ten samples from two slopes of ‘’Evolution Canyon”.

a, SNP frequency distribution is displayed as ten circular histograms representing ten

barley accessions from two slopes. From inner to outer, circle 1-10 indicates barley

accessions of NFS148, NFS146, NFS145, NFS144, NFS143, SFS125, SFS124,

SFS123, SFS122, SFS120. The height of bars indicates the SNP number (logarithm

with base of 10) in each 1 Mb genomic region. b, Neighbour-Joining tree

construction. c, Principle component analysis. This analysis was conducted based on

49

218,709 SNPs without miss ing data using EIGENSOFT. d, Population structure

analysis was conducted using admixture.

To characterize the genetic divergence of wild barleys from the opposing slopes in

the “Evolution Canyon”, neighbour-joining tree, principle component and population

structure analysis are conducted based on the SNP variants. From the neighbour-

joining tree ( 3-3b), five accessions from the SFS slopes were clearly separated from

the other five accessions from the NFS group. The evolutionary distances was

calculated using F84 model (Felsenstein, 1981) based on SNP variants of these ten

wild barley accessions (Table 3-1). The evolution distance among five accessions in

SFS are 0.14~0.23 and in NFS 0.18~0.16. However, the distance between SFS group

and NFS group is up to 1.03, which indicates the dramatic genetic divergence

between SFS and NFS wild barleys. Wild barleys from the opposing slopes were also

apparently split into two clusters according to the first eigenvector, which explains

up to 56% of genetic difference between SFS and NFS wild barleys ( 3-3 c). The

obvious genetic divergent of wild barley between the two slopes was also confirmed

by population structure analysis ( 3-3 d). When K value was set to two, five wild

barleys from SFS were inferred to originate from one ancestor while five NFS wild

barleys were from another ancestor.

3.3.2 Bigger genomic divergence between SFS and NFS wild

barleys than that between Tibetan wild barleys and

domesticated barleys.

50

To further investigate the genomic divergence of wild barleys from the opposing

slopes, we performed high coverage whole genome sequencing with another ten

barley accessions. They represent four barley groups: six wild barley accessions

collected from the opposing slopes of the ‘Evolution Canyon’ (SFS and NFS), two

wild barley accessions from the Tibetan Plateau (TB1, TB2) and two cultivated

barleys (DB1 and DB2) (Table 3-1). About 190 Gb of clean data was obtained for

each barley accession on average, which corresponded to about 39.4× genome

coverage of the recently published barley genome reference (Mascher et al., 2017).

De novo assembly was performed for eight out of the ten accessions with the

assembly length ranging from 3.63 to 4.54 Gb and N50 ranging from 7.18 to 10.89

Kb (Table 3-2). More than 99% of clean reads were mapped to the barley reference

genome (Mascher et al., 2017). Up to 84% of the reference sequence positions were

sequenced at least five times. Based solely on reads with unique mapping position in

the reference genome, 63,565,829 single nucleotide variants (SNPs) and 2,673,996

short insertions/deletions (InDels) were identified for ten barley accessions in total.

The distribution of SNP in each 1 Mb region along chromosomes was plotted in

3-4a. In contrast to the RNA sequencing data, the SNPs were evenly distributed in

the wild barley genomes from EC. However, loss of genetic diversity in some

chromosome regions were clearly evident in the cultivated barleys and even in

Tibetan wild barleys.

51

Table 3-1 Output statistics of whole-genome shotgun sequencing of ten barley accessions.

NCBI-ID Sample Describe Reads Read length (bp) Clean Data (nt) Coverage

SAMN05207298 DB1 Canadian cultivar barley 2,025,830,279 95 192,453,876,505 39.81

SAMN05207297 DB2 Australia cultivar barley 2,083,773,587 95 197,958,490,765 40.95

SAMN05207291 TB1 Tibetan wild barley 2,107,917,374 95 200,252,150,530 41.42

SAMN05207292 TB2 Tibetan wild barley 2,003,555,938 95 190,337,814,110 39.38

SAMN05207293 SFS2.1 SFS wild barley 2,025,527,964 95 192,425,156,580 39.81

SAMN05207294 SFS2.2 SFS wild barley 1,902,396,508 100 190,239,650,800 39.36

SAMN08038525 SFS2.3 SFS wild barley 721,898,538 150 108,284,780,700 22.36

SAMN08038526 NFS6.1 NFS wild barley 787,729,618 150 118,189,116,000 24.43

SAMN05207295 NFS7.1 NFS wild barley 1,668,485,369 95 158,506,110,055 32.79

SAMN05207296 NFS7.2 NFS wild barley 2,016,099,086 100 201,609,908,600 41.71

52

Table 3-2 Assembly statistics of eight barley accessions.

Variety

Genome

length (Gbps)

GC (%)

Total number

scaffolds (>= 100 bp)

Number scaffolds

Number

scaffolds in N50 set

Minimum

scaffold length

N50

Maximum

scaffold length

DB1 4.53 43.8 7,237,863 353,225 63,409 1,000 7,184 85,370

DB2 4.54 44.3 4,714,325 366,546 64,118 1,000 8,288 141,105

SFS2.1 4.49 44.4 4,818,935 352,645 60,957 1,000 8,764 113,245

SFS2.2 3.63 44.2 5,063,088 351,782 58,896 1,000 9,499 113,496

NFS7.1 4.47 44.3 5,138,819 354,303 62,415 1,000 8,374 132,282

NFS7.2 3.66 44.3 4,353,671 357,168 56,429 1,000 10,885 149,690

TB1 4.49 44.2 5,042,536 362,319 64,848 1,000 8,020 94,844

TB2 4.42 44.2 4,950,063 320,524 52,623 1000 10,199 145,095

53

Figure 3-4 Genomic diversity and evolution analysis based on whole genome

sequencing of ten samples from four different groups.

a, SNP frequency distribution is displayed as ten circular histograms representing

ten barley accessions. From inner to outer, circle 1-10 indicates barley accessions of

DB2, DB1, TB2, TB1, SFS2.2, SFS2.1, SFS2.3, NFS6.1, NFS7.2, NFS7.1. The

height of bars indicates the SNP number (logarithm to the base 10) in each 1 Mb

genomic region; b, Neighbour-Joining tree construction. It was conducted using

Phylip based on the sequence distance, which was calculated based on 29429 SNPs

evenly distributed in each chromosome. c, Principle component analysis. This

analysis was conducted based on 54,255,907 SNPs without missing data using

EIGENSOFT. d, Population structure analysis was conducted using admixture

54

Based on the genome-wide SNPs, neighbour-joining tree, principle component and

population structure were conducted to compare the evolution distance of wild barley

between the apposing slopes and that between Tibetan wild barley and cultivated

barley. In the neighbour-joining tree ( 3-4 b), six wild barley from the “Evolution

Canyon” were separated from both cultivated barley and Tibetan wild barley. The

evolutionary distance between SFS and NFS wild barleys is about 0.60, which is

about twice that between Tibetan wild barley and cultivated barley (~ 0.30) (Table

3-3). In the plot of principle component analysis ( 3-4c), six EC wild barleys were

separated from Tibetan wild barley and cultivated barley in the first eigenvector (~

25%). Three SFS wild barleys were separated from three NFS wild barleys in the

second eigenvector (~ 21%). Both neighbour-joining tree and the PCAs revealed a

distinct separation of the NFS and SFS wild barley populations, in line with the

sharply divergent microclimates of the two slopes at EC. To estimate the individual

evolutionary ancestry and admixture proportions, a population genetic structure

analysis with a K-value from 2 to 4 was conducted ( 3-4d). When K was set to 3,

NFS and SFS wild barleys are indicated to be derived from two different ancestors

while domesticated barley and Tibetan wild barleys are from the third ancestor.

When K was set to 4, four groups of barleys are assumed to be derived from four

different ancestors. This result also shows that the evolutionary relationship between

SFS and NFS wild barleys are much farther than that between Tibetan wild barley

and cultivated barley.

55

Table 3-3 Evolutionary distance for ten EC wild barleys using whole genome shotgun sequencing

sample DB1 DB2 TB1 TB2 SFS2.1 SFS2.2 SFS2.3 NFS6.1 NFS7.1 NFS7.2

DB1 0.00

DB2 0.06 0.00

TB1 0.28 0.30 0.00

TB2 0.29 0.31 0.39 0.00

SFS2.1 0.58 0.60 0.63 0.65 0.00

SFS2.2 0.57 0.59 0.63 0.60 0.40 0.00

SFS2.3 0.55 0.57 0.61 0.63 0.39 0.36 0.00

NFS6.1 0.57 0.59 0.57 0.60 0.61 0.55 0.50 0.00

NFS7.1 0.57 0.59 0.56 0.57 0.56 0.52 0.64 0.43 0.00

NFS7.2 0.58 0.60 0.57 0.58 0.57 0.52 0.65 0.43 0.02 0.00

56

3.3.3 Genomic regions under environmental selection of NFS

and SFS

To uncover the genetic basis of the remarkable genetic divergence and to infer

genomic regions subjected to environmental selection, Wright’s F-statistic FST was

calculated between SFS and NFS wild barley groups. The frequency distribution of

FST value between SFS and NFS wild barleys was featured by a multimodal

distribution. The distribution of the windowed FST value between NFS and SFS wild

barley along each chromosome is shown in 3-6. The total length of genomic regions

with FST values above the threshold of 0.89 was about 243 Mb, which contains 919

high confidence genes. A continuous genomic region of about 200 Mb in the central

part of chromosome 3H showed strong environmental selection.

57

Figure 3-5 Distribution of FST values across the whole genome.

FST value was calculated in each 1 Mb region with steps of 250 Kbs. The x-axis

indicates the value of FST, and y-axis shows the FST value frequency.

58

Figure 3-6 Selection sweep analysis in whole genome.

It was conducted based on on 54,255,907 SNPs from whole genome sequencing

without missing data using VCFtools. Red horizontal line indicates the threshold of

95%, which is the Fst value of 0.89.

To further investigating whether the environmental selection resulted in speciation

between the SFS and NFS wild barleys, we searched the candidate genes responsible

for reproductive isolation in several plant species including crops (Bikard et al. ,

59

2009; Chen et al., 2008; Long et al., 2008; Mizuta et al., 2010; Yamagata et al.,

2010; Yang et al., 2012). Using BLASTP, with a threshold E-value ≤2.00E-56 and

aligned amino acids length >100, we identified 22 homologous genes for

reproductive isolation in the barley genome. 88 unique SNPs or InDels were found

between the NFS and SFS genotypes in twelve genes related to reproductive

isolation (Table 3-4). Five of these genes were in the above selective sweep regions

(HORVU4Hr1G084220, HORVU3Hr1G052250, HORVU1Hr1G060040,

HORVU3Hr1G107970, and HORVU3Hr1G107990). Notably, two genes

(HORVU3Hr1G052250, HORVU1Hr1G060040) mapped in the selection region

showed unique variants between SFS and NFS populations. A gene homologous to

the rice hybrid male sterility locus SaF (HORVU3Hr1G052250), resulted in indica-

japonica speciation through a two-gene/three-components interaction model (Long et

al., 2008), was located in the selection sweep signature region of chromosome 3H (

3-6). A substitution of Phe287Ser between SaF+ and SaF- impairs biological

function and results in male sterility in the indica-japonica hybrid. It is interesting to

observe a substitution of Thr179Ala between the SFS and NFS genotypes. Whether

this substitution impacts reproduction between interslope wild barley populations

remains to be validated.

60

Table 3-4 Plant reproductive related homology genes in barley

Speciation Related Genes AA-len Target Target Length Aligned-Length Identity E-value Mismatch Gaps

HBD2 463 HORVU6Hr1G058230.9 418 431 88.17 0 35 16

L27 145 HORVU2Hr1G066830.8 146 145 84.83 9.00E-90 22 0

HPA1 417 HORVU6Hr1G069390.6 263 260 74.62 5.00E-145 66 0

S5-3 470 HORVU7Hr1G107190.2 455 326 50.61 7.00E-77 131 30

S5 357 HORVU4Hr1G084220.1 574 364 47.8 2.00E-106 168 22

CF2 344 HORVU7Hr1G044230.3 341 313 46.96 1.00E-95 156 10

CF2 344 HORVU6Hr1G084520.5 340 342 45.91 2.00E-98 172 13

CF2 344 HORVU1Hr1G060040.1 342 342 45.91 2.00E-102 172 13

TTG2 429 HORVU5Hr1G028340.3 356 256 45.7 1.00E-61 90 49

CF2 344 HORVU2Hr1G009150.1 342 339 45.13 5.00E-99 172 14

CF2 344 HORVU4Hr1G006310.1 346 344 43.6 9.00E-97 184 10

TTG2 429 HORVU2Hr1G036320.4 553 261 43.3 2.00E-56 116 32

TTG2 429 HORVU3Hr1G088200.6 405 319 41.38 3.00E-57 123 64

S5-3 470 HORVU5Hr1G078400.1 672 391 39.64 1.00E-75 213 23

S5-3 470 HORVU5Hr1G014490.1 430 421 39.43 7.00E-76 193 62

S5-3 470 HORVU6Hr1G008880.6 557 402 38.81 4.00E-76 221 25

SaF+ 476 HORVU4Hr1G021110.1 512 397 38.04 2.00E-71 235 11

SaF+ 476 HORVU6Hr1G025790.1 544 487 37.37 1.00E-79 260 45

SaF+ 476 HORVU3Hr1G052250.6 503 472 37.29 2.00E-84 266 30

Tcr:Smoke1 504 HORVU1Hr1G081310.3 359 319 36.05 3.00E-59 196 8



61

3.3.4 Differential drought stress response of SFS and NFS

wild barleys

Drought is a key climatic stress that distinguishes SFS and NFS. To compare the

drought stress response of wild barleys from the two slopes, transcriptome

sequencing was performed for SFS and NFS wild barleys with treatment to simulate

drought stress in the SFS. About 44.40 Gb clean data was generated for twelve

samples including SFS and NFS wild barleys with two conditions and three

replications. Differential gene expression was analysed for SFS and NFS wild barley

under drought stress and the overview information is given in Table 3-5. 274 genes

were commonly detected to be differentially expressed in both SFS and NFS wild

barley under drought. 2,791 genes were differentially expressed only in SFS wild

barley under drought stress whereas 1,030 differentially expressed genes only in NFS

wild barley (Figure 3-7). The density of uniquely differential expression genes in

each 10 Mb region along seven chromosomes were showed in Figure 3-8. We have

tried to plot the normal distribution and found that the gene density distribution of

DEGs is more obvious than the normal distribution. GO enrichment analysis was

conducted for DEGs in SFS and NFS wild barleys under drought stress (Figure 3-9).

Over-represented GO terms DEGs in SFS wild barley includes translational

elongation, ribosome biogenesis, organic acid metabolic process, lipid metabolic

process and so on (Figure 3-9). The over-represented GO terms in NFS wild barley

are different with that in SFS. For example, lipid metabolic process is over-

represented in the DEGs of SFS wild barley, while it is under-represented in the

62

DEGs of NFS wild barley. The over-represented GO terms in NFS wild barley

includes amine metabolic process, tryptophan metabolic process and so on.

Table 3-5 Summary of DEGs in SFS and NFS wild barley under drought treatment

Treatments Total DEG Up-regulation Down-regulation

SFS 2791 1607 1184

NFS 1030 212 818

overlap 274 129 145

Figure 3-7 Statistics of DEGs under drought stress treatment in SFS and NFS.

Yellow circle indicates numbers of DEGs in SFS, blue for DEGs in NFS and overlap

between them is the DEGs detected in both SFS and NFS.

63

Figure 3-8 Density of DEGs in each 10 Mb region along chromosomes under

drought stress treatment in SFS and NFS.

Y-axis indicates the differential gene number in each 10 Mb region and X-axis

indicates physical position. Blue dots represent DEGs detected only in SFS under

drought stress treatment and red represent DEGs in NFS and green for DEGs

detected in both SFS and NFS

64


expressed genes in SFS wild barley under drought stress.

65


expressed genes in NFS wild barley under drought stress.

66

In theory, DEGs response to drought treatment from the SFS is more likely to play

important roles in enhancing the drought tolerance. Thus, we compared the protein

similarity between genes proved to be responsible for drought tolerance in nine

species (Alter et al. , 2015) and drought-treated DEGs in SFS wild barley but not

NFS. Particularly, we have found 46 of them having homologous genes with

experimental evidences to control plant drought resistance (Table 3-6). For example,

the protein sequence identity between HORVU2Hr1G060350 and CIPK3 is up to

64% with the coverage of 95% (425/446 peptides). CIPK3 was reported to regulate

abscisic acid and cold signal transduction as a calcium sensor-associated protein

kinase in Arabidopsis (Kim et al., 2003). It was also demonstrated that the expression

of CIPK3 is responsive to ABA and stress conditions and the expression pattern of a

number of stress gene markers were altered by the disruption of CIPK3. Another

drought responsive differential expression gene HORVU2Hr1G075470 in SFS was

detected to have 72% protein sequence identity with 92% coverage with SRK2C

gene. The SRK2C was reported to improve drought tolerance by controlling the

expression level of stress-responsive genes in Arabidopsis (Umezawa et al., 2004).

67

Table 3-6 Homology of plant drought related genes in barley

Index DEGs-SFS drought-genes gene / ID / organism function

1 HORVU1Hr1G004150 LOC_Os01g42860 OCPI1 / LOC_Os01g42860 / Oryza

sativa Oryza sativa chymotrypsin

inhibitor-like 1

2 HORVU1Hr1G020690 Solyc11g018800 PO2 / Solyc11g018800 / Capsicum

annuum extracellular peroxidase 2

3 HORVU1Hr1G049210 AT1G10370 GSTU17 / AT1G10370 / Arabidopsis

thaliana glutathion s-transferase U17



5 HORVU1Hr1G063560 AT2G18960 OST2 / AT2G18960 / Arabidopsis

thaliana

plasma membrane proton

ATPase

6 HORVU2Hr1G009580 AT2G47800 MRP4 / AT2G47800 / Arabidopsis

thaliana multidrug resistance-associated

protein, ABC transporter

7 HORVU2Hr1G009940 Solyc01g111510 APX / Solyc01g111510 / Populus

tomentosa ascorbate peroxidase

68


8 HORVU2Hr1G010990 AT3G53420 BnPIP1 / AT3G53420 / Brassica

napus PIP

9 HORVU2Hr1G018260 LOC_Os11g02240 CIPK15 / LOC_Os11g02240 / Oryza

sativa calcineurin B-like protein-interacting protein kinase






sativa

calcineurin B-like protein-

interacting protein kinase

13 HORVU2Hr1G075470 AT1G78290 SRK2C / AT1G78290 / Arabidopsis

thaliana

SNF1-related protein kinase 2,

osmotic-stress-activated protein kinase

14 HORVU2Hr1G078570 Solyc00g015750 ALR / Solyc00g015750 / Medicago

sativa aldose/aldehyde reductase

15 HORVU2Hr1G079920 AT1G15820 LHCB6 / AT1G15820 / Arabidopsis

thaliana light harvesting chlorophyll a/b

binding protein

69



thaliana multidrug resistance-associated


17 HORVU2Hr1G111760 LOC_Os04g56130 GbRLK / LOC_Os04g56130 /

Gossypium barbadense Receptor-like kinase

18 HORVU3Hr1G007410 Solyc00g015750 ALR / Solyc00g015750 / Medicago

sativa aldose/aldehyde reductase

19 HORVU3Hr1G023220 AT5G49890 CLCc / AT5G49890 / Arabidopsis

thaliana chloride channel


sativa

calcineurin B-like protein-

interacting protein kinase







70






26 HORVU4Hr1G055220 AT1G01360 PYL9/RCAR1 / AT1G01360 /

Arabidopsis thaliana

soluble ABA receptor interacts with and regulates PP2Cs ABI1

and ABI2

27 HORVU4Hr1G080820 MLOC_77777 HVA1 / MLOC_77777 / Hordeum

vulgare group 3 late embryogenesis

abundant protein

28 HORVU5Hr1G000780 AT2G18960 OST2 / AT2G18960 / Arabidopsis

thaliana

plasma membrane proton

ATPase

29 HORVU5Hr1G030810 LOC_Os03g03660 CDPK7 / LOC_Os03g03660 / Oryza

sativa calcium-dependent protein

kinase



31 HORVU5Hr1G048240 AT2G29980 FAD3 / AT2G29980 / Brassica napus fatty acid desaturase

71


32 HORVU5Hr1G054510 LOC_Os09g13570 OsbZIP71 / LOC_Os09g13570 /

Oryza sativa bZIP transcription factor

33 HORVU5Hr1G077910 AT1G52400 AtBG1 / AT1G52400 / Arabidopsis

thaliana

beta-glucosidase; hydrolyzes glucose-conjugated, inactive

ABA to active ABA

34 HORVU5Hr1G080300 MLOC_54227 CBF4 / MLOC_54227 / Hordeum

vulgare CBF/DREB TF



36 HORVU5Hr1G097500 Solyc04g077980 DgZFP3 / Solyc04g077980 /

Chrysanthemum x morifolium

Cys2/His2-type zinc finger

protein gene



38 HORVU5Hr1G103890 AT2G27150 AAO3 / AT2G27150 / Arabidopsis

thaliana

Arabidopsis aldehyde oxidase, catalyzes final step in ABA

biosynthesis

39 HORVU6Hr1G019540 AT4G31120 CAU1 / AT4G31120 / Arabidopsis

thaliana

H4R3sme2 (for histone H4 Arg

3 with symmetric dimethylation)-type histone

72


methylase protein arginine methytransferase5/Shk1

binding protein1

40 HORVU6Hr1G035470 AT5G67300 MYB44 / AT5G67300 / Arabidopsis

thaliana MYB type TF


thaliana

multidrug resistance-associated


42 HORVU7Hr1G028910 AT1G15690 AVP1 / AT1G15690 / Arabidopsis

thaliana vacuolar membrane H+-

Pyrophosphatase

43 HORVU7Hr1G038770 AT1G75270 DHAR2 / AT1G75270 / Arabidopsis

thaliana dehydroascorbate reductase

44 HORVU7Hr1G078440 MLOC_62301 Srg6 / MLOC_62301 / Triticum

aestivum stress-responsive gene, TF





73

3.4 Discussion

Cultivated barley were domesticated around 10,000 years ago at the Fertile Crescent

where their wild relatives still thrive today. However, a large proportion of suitable

habitats for wild barley have been lost due to climate changes since the last glacial

ages when the climate was much wetter and cooler (Russell et al., 2014). Located at

Mount Carmel in Israel, EC is the natural habitat of wild barley and often referred to

as the ‘Israeli Galapagos’(Nevo, 2014a). The SFS in EC features extreme levels of

solar radiation (200‒800% higher than the NFS) and the associated higher daily

temperatures and drought (Nevo, 1995, 2012; Pavlícek et al. , 2003). NFS is

characterized by a cooler climate with higher relative humidity (1–7%) than SFS.

Thus, EC provides an ideal model to understand the impact of global warming on

genome evolution. Despite evidence of substantial gene flow between the abutting

slopes, significant interslope genetic variation and diversity were detected In this

study, whole genome-wide diversity was investigated for both SFS and NFS wild

barleys. And dramatic genetic divergence was proved to exist between SFS and NFS

wild barleys both in transcriptome level and whole genome level. It was also

suggested that the genetic divergence between SFS and NFS wild barleys are much

larger than that between Tibetan wild barley and cultivated barley. All these evidence

indicates that sympatric speciation exists in wild barley in EC microclimate

environment, which coincides with that reported in other species (Hadid et al., 2014;

Li et al., 2016b; Parnas, 2006).

Considering that the limited gene flow may be a major factor causing the sympatric

speciation of wild barley in the opposing slopes, allele frequency was analysed in

74

whole genome based on the Wright’s F-statistic FST. Up to 243 Mb genomic regions

containing 919 high confidence genes were showed to be under the selection of

differential microclimates between SFS and NFS. 52 out of the 919 genes

differentially expressed in SFS wild barley under drought stress and 18 were

differentially expressed in NFS. There is a large genomic region with le ngth of about

200 Mb in central part of chromosome 3H showing dramatic environmental

selection. We also performed Fluorescence in situ hybridization (FISH) to compare

the overall karyotypical difference of each chromosome between SFS and NFS wild

barleys (Figure 3-11). The dramatic karyotypical difference was detected in the

chromosome 3H between SFS and NFS wild barley.

Figure 3-11 Karyotypical comparison of wild barley NFS6.1 and SFS2.3 and

cultivated barley Golden Promise by FISH.

75

The most contrasting environmental difference between the opposing slopes is

drought and heat due to different levels of solar radiation. Drought alters the gene

expression of compounds that protect cells from dehydration, including genes

belonging to the Dehydrin gene group and heat shock protein 17. Suprunova et al

(Suprunova et al., 2004) measured the expression levels of Dhn genes in tolerant and

sensitive wild barley genotypes, and found that Dhn levels increased earlier and with

higher intensity in tolerant genotypes than in sensitive genotypes. Confirming these

findings, Yang et al. (2009) detected higher levels of Dhn expression in the resistant

SFS compared to sensitive NFS wild barley after dehydration. To investigate the

influence of drought stress on the transcriptional level of wild barley from the

opposing slopes, we performed transcriptome sequencing and detected DEGs

responding to the drought treatment. The number of DEGs in SFS barley is about

three times that in NFS barley after dehydration stress, further supporting interslope

genotypic divergence between tolerant genotypes on the SFS and sensitive genotypes

on the NFS. Lipid metabolic process is over-represented in DEGs of SFS wild barley

but under-represented in DEGs of NFS wild barleys. It was previously reported that

the adjustment of lipid content in cell membrane plays vital roles in response to

drought stress in Arabidopsis (Gigon et al., 2004). Thus, adjustment of lipid

metabolism may be an important mechanism for the wild barley in SFS to adapt to

the extreme drought and heat environment.

Diverse stress factors such as drought can either accelerate or delay flowering in

many plant species. Alteration of flowering in response to stress is known as stress-

induced flowering, which has recently been considered a third category of flowering

response in addition to photoperiodic flowering and vernalisation (Takeno, 2016).

76

Wild barley from SFS flowered one to two weeks earlier than NFS wild barley, even

under well-watered conditions (Parnas, 2006). Different flowering date was also

assumed to be one factor limiting the gene flow between SFS and NFS wild barley.

Recently, Nevo et al (Nevo, 2012) reported that wild barley and wild emmer wheat

(Triticum dicoccoides) populations across Israel flowered about ten days earlier in

2009 compared to 1980 as a result of rising average temperatures due to global

warming. In this study, seven genes with roles in flowering time were detected

within the selective sweep regions , including PHYB and GI. PHYB modulates gene

activity to influence plant growth and development and increases drought tolerance

by enhancing abscisic acid (ABA) sensitivity in Arabidopsis thaliana (Gonzalez-

Guzman et al., 2012). Here, PHYB transcript was present at higher levels in SFS and

lower levels in NFS wild barley after drought, suggesting that higher PHYB

expression alleviates the symptoms of stress more effectively in SFS wild barley.

Drought also alters the circadian clock, impacting the expression of circadian clock

genes such as GIGANTEA (GI) to mediate different signaling pathways that

modulate drought stress perception and responses (Riboni et al. , 2013). In

Arabidopsis, GI enabled the drought escape response via ABA-dependent activation

of the flowering time genes FLOWERING LOCUS T (FT), TWIN SISTER OF FT

(TSF) and SUPPRESSOR OF OVEREXPRESSION OF CONSTANS (SOC1).

Drought is a stress that impacts both SFS and NFS, therefore, both had increased GI

levels after drought. The genotype from NFS was particularly vulnerable to drought,

such that GI was more up-regulated than down-regulated in SFS to potentially induce

earlier flowering. The wild barleys in EC have evolved a different mechanism to

control flowering time through light signalling pathway comparing to the cultivated

77

barley in which photoperiodic response and vernalisation are the major factors for

flowering.

Sympatric speciation—the evolution of reproductive isolation without geographic

barriers—is a highly contentious concept in evolutionary biology (Bird et al. , 2012;

Bolnick and Fitzpatrick, 2007). Whether natural selection can outperform the

continuous homogenizing force of free gene flow in a sympatric ecological condition

and generate reproductive incompatibility is under strong debate. Our study is the

first one to use whole-genome sequencing and transcriptome sequencing to

investigate the genomic differentiation pattern of sympatric speciation in a plant

species. A genome-wide FST analysis revealed detailed information on the genomic

differentiation between SFS and NFS barley. Notably, the total genomic region

identified as a divergence outlier (FST > 0.89) was about 243 Mb ( 3-6), representing

~5% of the barley reference genome, and falls within the 0.4–35.5% range reported

for other plants under divergent selection (Strasburg et al., 2012). This value is

higher than the 3.6% reported for Oenanthe conioides and Oenanthe aquatic, two

Apiaceae plant species under sympatric speciation (Westberg and Kadereit, 2014). In

another study, on the sympatric speciation of a palm tree in an oceanic island, only

four of the 274 AFLP loci differed strongly between the two diverging populations

(Savolainen et al., 2006). In our study, the wild barley on the two opposing slopes are

under diverse and strong environmental selection involving heat, drought,

illuminance and pathogen stresses (Nevo, 1995; Nevo, 2014a), which may explain

the different result from the other two sympatric speciation cases, where ecological

selection in the corresponding plant populations was driven by a single

environmental factor such as soil pH or fresh water tides (Savolainen et al., 2006),

78

respectively. Moreover, the proportion of genomes under divergence in both studies

was estimated from AFLP analyses which may be biased and unable to uncover all

the genomic regions under natural selection. Interestingly, genomic divergence

analyses of the SFS and NFS barley revealed one prominent ‘chromosome island’ on

chromosome 3H under strong natural selection, spanning a continuous region of

around 200 Mb ( 3-6). This ‘unusual’ divergence signature represents a typical

genomic differentiation pattern expected for sympatric speciation, where only those

genetic regions directly related to the specific environmental selection pressure (most

likely heat, drought, illuminance stresses) will be differentiated, while the remainder

of the genome is subjected to continuous homogenizing forces due to free gene flow

in the sympatric setting (Strasburg et al., 2012). This is consistent with the genetic

differentiation pattern reported in the other two sympatric speciation cases of plants

(Savolainen et al., 2006). In contrast, genetic differences are expected to accumulate

throughout the whole genome for allopatric speciation with geographic isolation (Via

and West, 2008). Our results lend further support to the sympatric speciation of wild

barley at EC at the genetic divergence level. Detailed examination of the genomic

function of this chromosome island may shed light on the speciation process of

plants and allow us to understand better how wild barley evolves to adapt to highly

diverse environmental conditions across the world. Our results are also supported by

the previous studies based on F1 hybrids from 40 intraslope and 50 interslope crosses

tested for a range of vegetative and reproductive traits , supports incipient sympatric

speciation in wild barley from opposing slopes at EC (Nevo, 2006; Parnas, 2006).

For most traits, interslope crossbred plants were significantly inferior to intraslope

hybrids.

79

In summary, we observed strong genomic differentiation of wild barley across a

short distance of 250 m due to the sharp microclimatic divergence between the

abutting slope populations at Evolution Canyon I, Israel. To put the local patterns of

genetic variation into a broader geographic context, we also analyzed two Tibetan

wild barley accessions and two Canadian and Australian cultivated barley varieties.

Sample accessions from both sides of the slopes at EC confirmed the presence of

strong genomic differentiation underlying adaptive incipient sympatric ecological

speciation. The process of adaptive incipient sympatric speciation in wild barley at

EC, due to sharp interslope microclimatic divergence is supported by molecular

markers, genomic and transcriptomic evidence, and is shared with organisms across

phylogeny from bacteria to mammals identified in previous studies (Nevo, 2014a).

The genetic base of cultivated barley has rapidly narrowed resulting in loss of genetic

diversity, increasing the risk that cultivated barley will become increasingly

susceptible to diseases and environmental stresses. Wild and cultivated barley are

both diploid and inter-fertile opening up the possibility of using wild barley for

genetic variation in the improvement of cultivated barley against abiotic and biotic

environmental stresses.

Contribution statement

Eviatar Nevo (EN), Chengdao Li(CL) collected the samples. Xiaoqi Zhang (XZ), Fei

Dai (FD), CL performed DNA sequencing. FD, Xiaolei Wang (XW), Guoping Zhang

(GZ) performed RNA sequencing and the drought experiment. Cong Tan (CT),

Penghao Wang (PW), Brett Chapman (BC), Brett Chapman (RB), Matthew Bellgard

(MB) processed the data and performed computational analyses. CT, CBH (Camilla

80

Beate Hill), PW, BC, RB, Qisen Zhang (QZ), CL analysed the data and interpreted

the results. CL, EN and GZ conceived and designed the study.

81

Chapter 4 Genomic diversity of cultivated

barley

Abstract

Barley (Hordeum vulgare L.) is one of the earliest domesticated crops and ranks the

fourth in the world. It provides abundance of nutrition and energy for humans and

animals. And it has been widely used as model plant for Triticeae research. Recently,

a high quality of pseudomolecule genome sequence was generated for barley cv.

Morex, which paved the way to investigate the genomic diversity and isolate

functional genes by DNA sequencing in barley. In this study, we selected eleven

cultivated barleys from Australian cultivated barley germplasm to investigate the

genomic diversity. They were sequenced with whole genome shotgun sequencing

strategy with high-coverage from 25-fold to 42-fold. In total, 29,550641 SNPs and

1,592,034 InDels were detected for the eleven cultivated barleys. After comparing

the divergence between Australian cultivated barelys and wild barleys from ‘the

Evolution Canyon’, a total of 612 Mb genomic regions containing 2,649 high-

confidence genes were demonstrated to be under the selection of domestication. The

genomic variants and candidate genes under selection of domestication identified in

this study will benefit the isolation of QTLs controlling important agronomic traits

and the development of genome-wide molecular markers for molecular breeding of

new varieties in barley.

82

4.1 Introduction

Barley is the second most important crop in Australian and the fourth in the world

both in terms of grain production and harvested area (www.fao.org/faostat). Barley

provides abundance of energy and nutrition for human and animal. Meanwhile,

malting barley is an important raw material for brewing beer and distilled beverage.

Cultivated barley (Hordeum vulgare L.) is domesticated from wild barley (Hordeum

spontaneum L.) around 10,000 years ago at the Fertile Crescent (Harlan and Zohary,

1966). Barley can grow in highly diverse environments from the Eastern

Mediterranean coasts to Tibet (Dai et al. , 2014; Zeng et al. , 2015). It has been widely

used as model organism to research plant adaptive evolutionary processes as well as

vital genetic repository to explore plant abiotic and biotic stress tolerance.

Barley is diploid (2n=2x=14) with an estimated haploid genome of 5.1 Gb. Barely’s

genome is featured by extreme abundance of repetitive elements , which hindered the

assembly of barley genome. In 2012, a barley physical map with total length of 4.83

Gb was assembly based on high-information-content fingerprinting, BAC-end

sequencing and a high density genetic map (International Barley Genome

Sequencing et al., 2012; Mascher et al., 2013a). The physical map could be

represented by a minimum tilling path of 67,000 BAC clones. A gene space

assembly was generated by sequencing and assembling of 5,341 gene-containing

BACs, which had benefited functional QTL identification by linkage mapping (Zhou

et al., 2015) and association mapping (Russell et al., 2016), and genes isolation such

as MKK3 and AlaAT regulating seed dormancy (Nakamura et al., 2016; Sato et al. ,

2016). The past several years has witnessed the dramatic development of DNA

83

sequencing technology and molecular biology such as hierarchical shotgun

sequencing, optical mapping, and chromosome conformation capture sequencing,

which provide new approaches to resolve the challenges of barley genome assembly.

A high-quality pseudomolecule genome of barley was eventually achieved by the

International Barley Genome Sequencing Consortium in 2017 (Beier et al., 2017;

Mascher et al., 2017). The pseudomolecule (4.79 Gb) covers 94% of barley genome,

within which 39,734 high confidence genes were annotated. The completion of

barley reference genome is an important milestone of barley genetic research, which

marks the full entry of the barley genetic research into a genomic era represented by

high throughput DNA sequencing.

In this study, we use the achieved barley genome sequence as a reference to

investigate the genomic diversity in Australian cultivated barleys. Eleven accessions

were selected from Australian cultivated barley germplasm and sequenced with

whole-genome shotgun sequencing approach. Firstly, we detected the whole

genome-wide SNPs and short InDels in the eleven Australian cultivated barleys.

Secondly, we investigated the evolutionary relationship of Australian cultivated

barley with 15 other barley accessions and the genomic divergence of Australian

cultivated barley between wild barley from ‘the Evolution Canyon’.

84

4.2 Methods

Materials and datasets

In this study, eleven Australia cultivated barleys (Table 4-1, No.1-11) were selected

and sequenced with whole-genome shotgun strategy. They were grown in Murdoch

University glasshouse under standard management. Total genomic DNA was

extracted from leaf at three leave stage for each sample and randomly fragmented

using ultrasound method for sequencing libraries preparation. Sequencing libraries

were prepared according to the manual of Illumina sequenc ing library kit (Paired-

End Sample Preparation Guide, Illumina, 1005063). Then, they were sequenced on

HiSeq 2500 platform in BGI-Shenzhen (Beijing Genomics Institute-Shenzhen,

Shenzhen, China).

Besides, whole genome sequencing data of 10 barley accessions from our previous

project (Table 4-1, No.12-21). Five samples (Bowman, Barke, Haruna_Nijo, Igri,

B1k-04-12) sequenced by the Leibniz Institute of Plant Genetics and Crop Plant

Research (IPK) are downloaded from the NCBI-SRA database. We only

incorporated the sequencing data of these samples into our analysis to compare the

difference between genomic difference between Australian barleys and European

barleys.The detail information of these samples and data are given in Table 4-1.

85

Table 4-1 Sample and data information

No. Sample Type Data_origin Institute

1 Scope AUS cultivar sequencing WBGA

2 La_Trobe AUS cultivar sequencing WBGA

3 WI4304 AUS breeding line sequencing WBGA

4 Commander AUS cultivar sequencing WBGA

5 Fleet

AUS cultivar sequencing WBGA

6 Hindmarsh AUS cultivar sequencing WBGA

7 Vlamingh AUS cultivar sequencing WBGA

8 Franklin AUS cultivar sequencing WBGA

9 YYXT China cultivar sequencing WBGA

10 XYH0 China cultivar sequencing WBGA

11 XYH2 China cultivar sequencing WBGA

12 AC_metcarlfe AUS cultivar previous project WBGA

13 Baudin AUS cultivar previous project WBGA

14 SFS2.1 EC wild barley previous project WBGA



17 NFS6.1 EC wild barley previous project WBGA



20 TB1 Tibetan wild barley previous project WBGA

21 TB2 Tibetan wild barley previous project WBGA

22 Igri Japan cultivated barley EMBL/ENA IPK

23 Haruna Nijo Japan cultivated barley EMBL/ENA IPK

24 Bowman US cultivated barley EMBL/ENA IPK

25 Barke German cultivated

barley EMBL/ENA IPK

26 B1k-04-12 Israel wild barley EMBL/ENA IPK

86

Mapping reads to reference and detecting genomic variant

The detailed procedures for mapping reads against reference and genomic variation

detection for the collected sequencing data can refer to Chapter 3, and the summary

is given as follows. Firstly, a strict filtration was performed to remove reads with

adapter contamination and trim the poor-quality bases for each accession using

Cutadapt (Martin, 2011). The quality of clean data was assessed using Fastqc

(Andrews, 2014) before further analysis. Then, the barley pseudo molecular genome

was used as the reference to map the clean reads using BWA-MEN (Li, 2013). Next,

only those having unique mapping positions remained and proceeded to detect

genomic variation (SNPs and InDels). The final SNPs and InDels were generated by

running two rounds of variants calling analysis using SAMTools/BCFTools

(Hofmann, 2014; Ibarra et al. , 2016) and GATK (Graeber et al., 2012) pipelines. The

first round was performed with SAMTOOLS and BCFTOOLS for all accessions and

the result was controlled based on the variants quality. Then, the calling result of the

first round was used as the input file to guide the realignment around potential InDels

and variants calling for the second round in GATK. The SNPs and InDels were

detected by both SAMTools/BCFTools pipeline and GATK were retained and then

strictly filtered mainly based on base quality, mapping quality and variants

supporting depth (Li, 2014).

Evolutionary relationship and selective sweep analysis

The evolutionary distances were calculated using the Maximum Composite

Likelihood method, based on SNP variants evenly distributed in whole-genome

without missing data. Phylogenetic tree was constructed using the Neighbour-Joining

87

method in MEGA software and validated with bootstrap test (1000 replicates).

Selective sweep analysis was conducted by VCFtools, which measures the patterns

of allele frequencies in each 1 Mb fragment along chromosomes using the whole-

genome-wide SNP variation data. Genomic regions with FST values above 0.89 were

considered under strong selective sweeps.

Genetic effect prediction and gene function annotation

Genetic effect of each variant was predicted using SnpEff (Cingolani et al., 2012)

based on its position in gene models and the changing of coding product. Function

annotation of genes with high effect were predicted based on the sequence homology

with well researched genes in databases of NR, Swiss-Prot, KEGG, COG and GO.

This step was achieved using AutoFACT (Koski et al., 2005).

88

4.3 Results

4.3.1 Whole genome sequencing of eleven cultivated barleys

To investigate the genomic diversity in cultivated barleys, eleven barley cultivars

were selected as representatives for whole genome shotgun sequencing. The identity

of the Australian barleys have been examined according to their DNA fingerprint.

For each sample, genomic DNA was fragmented and fragments with the size of 300

to 500 bp were selected for preparing sequencing libraries. For eleven cultivated

barleys, paired-end sequencing with reads length of 100 to 150 bp produced a total of

16,052,460,585 clean reads after filtration. Detail information of sequencing output

for each sample was given in Table 4-2. Sequencing coverage of individual sample

ranges from 25-fold to 42-fold. Clean reads of each accession were mapped to the

barley reference genome using BWA-MEM and mapped reads with gaps were

realigned using GATK. Mapping rate is about 95% on average for these eleven

barley accessions. Only those reads have unique mapping positions with high quality

scores were further used to detect SNPs and short InDels

89

Table 4-2 Summary of sequencing output and variants calling

No. Sample Read lengh

(bp) Clean reads Clean bases Coverage (X) SNP number

InDel number

1 Scope PE-100 2,003,430,854 200,343,085,400 41.48 11,004,995 873,208

2 La_Trobe PE-100 1,867,321,773 186,732,177,300 38.66 11,462,699 892,956

3 WI4304 PE-100 2,198,504,849 219,850,484,900 45.52 12,345,774 935,979

4 Commander PE-100 2,164,074,376 216,407,437,600 44.80 13,125,947 987,294

5 Fleet PE-100 2,138,529,486 213,852,948,600 44.28 12,082,089 952,594

6 Hindmarsh PE-100 2,029,068,470 202,906,847,000 42.01 11,486,896 896,416

7 Vlamingh PE-100 1,997,091,617 199,709,161,700 41.35 12,511,246 943,524

8 Franklin PE-150 821,367,392 123,205,108,800 25.51 12,003,604 731,997

9 YYXT PE-150 853,749,138 128,062,370,700 26.68 13,342,817 766,413

10 XYH0 PE-150 668,729,618 100,309,442,700 20.90 9,896,567 552,573

11 XYH2 PE-150 810,200,780 121,530,117,000 25.32 10,244,910 614,319

90

4.3.2 SNP and InDel variants in eleven cultivated barleys

In total, 29,550641 SNPs and 1,592,034 InDels were detected for the eleven

Australian cultivated barleys. Detailed variant information of each sample was in

Table 4-2. On average, about 11 million SNPs and 867 thousand InDels were

detected for each cultivated barley. Barley cultivar Commander has the most SNPs

(13,125,947) and InDels (987,294) compared to the barley reference cv Morex.

Effects of all detected SNPs and InDels were predicted based on their positions in the

gene models of barley reference and change of the coding product. Variants located

in the intergenic regions accounted for about 88.35% of variants , while 0.49% of

variants were in exon regions. For those in exon regions, 194,656 of them caused

missense mutation and 25,235 caused frameshift mutation.

Table 4-3 Pairwise comparison of seven Australian barley varieties

Difference Command

er Fleet

Hindmash

La_Trobe

Scope Vlaming

h WI4304

Commander

0 1,825,40

8 1,905,59

8 1,861,09

8 2,361,19

1 2,188,03

3 2,509,81

4

Fleet 1,825,408 0 1,987,58

8 1,940,13

9 2,435,43

7 2,282,32

6 2,445,76

3

Hindmash 1,905,598 1,987,58

8 0 765,570

2,369,199

2,112,996

2,194,204

La_Trobe 1,861,098 1,940,13

9 765,570 0

2,294,582

2,071,247

2,154,517

Scope 2,361,191 2,435,43

7 2,369,19

9 2,294,58

2 0

2,580,188

2,560,647

Vlamingh 2,188,033 2,282,32

6 2,112,99

6 2,071,24

7 2,580,18

8 0

2,469,176

WI4304 2,509,814 2,445,76

3 2,194,20

4 2,154,51

7 2,560,64

7 2,469,17

6 0

91

4.3.3 Evolutionary relationship analysis

To investigate the evolutionary relationship of Australian cultivated barley with

cultivated barley from other continents and wild barleys, we incorporated the whole

genome sequencing data of ten samples from our previous project and five samples

sequenced by Leibniz Institute of Plant Genetics and Crop Plant Research.

Evolutionary distance was measured using the Maximum Composite Likelihood

method. The final phylogenetic tree was showed in Figure 4-1. Most Australian

cultivated barleys were clustered in one group, while seven Israel wild barleys were

clustered into another group.

92

Figure 4-1 Evolutionary relationships of 26 barley accessions.

The phylogenetic tree was inferred using the Neighbour-Joining method with

bootstrap replication of 1000. The evolutionary distances were computed using the

Maximum Composite Likelihood method. The analysis was based on 1,864 SNP

variants evenly distributed in whole genome of 26 barley accessions.

4.3.4 Genetic divergence between Australian cultivated

barley and EC wild barley

93

The genomic difference between the Australian and wild barleys were calculated

with the statistic value of Fst, which tests whether the allele frequency of each

screening window is significant different between the two groups. The Fst was

calculate using VCFtools with the screening window size of 1 Mb and moving step

of 250 Kbs’. Only SNPs with allele frequency over than 5% were kept for the sweep

selection analysis. The FST between these two barley groups was plotted in the

Figure 4-2. Large continuous genomic regions with environmental selection can be

seen around the centomere regions of chromosome 1H, 4H and 6H (Figure 4-2). In

total, 2,447 out of 19,341 windows have FST value bigger than 0.89, which covers

612 Mb genomic region. There are 2,649 high-confidence genes within the genomic

regions under environmental selection. GO enrichment for 933 out of 2,649 genes

was conducted and the result is showed in Figure 4-3. Most of genes under

environmental selection are classified into binding and catalytic activity in molecular

function as well as metabolic process in biological process annoation.

94

Figure 4-2 Genomic difference between Australian cultivated barleys and EC wild

barleys.

It was calculated based on whole-genome wide SNPs between Australian cultivated

barleys and ‘the Evolution Canyon’ wild barleys. Red horizontal line indicates the

threshold of 95%, which is the threshold value of 0.89.

95

Figure 4-3 GO enrichment of genes under environmental selection.

96

4.4 Discussion

The completion of barley reference genome with a high qua lity is an important

milestone in barley genetic research, which provides a blueprint to investigate the

genomic evolution of barley and to isolate the genes controlling important agronomic

traits (Mascher et al. , 2017). In this study, we sequenced eleven Australian cultivated

barleys using whole genome shotgun sequencing strategy to investigate the genomic

diversity in Australian cultivated barley and its domestication. A total of 29,550641

SNPs and 1,592,034 InDels were detected for the eleven Australian cultivated

barleys. Based on genomic variants, we investigated the evolution relationship

between Australian cultivated barleys with other fifteen barley accessions including

wild barleys and cultivated barleys from other continents. Most Australian cultivated

barleys are clustered into a group and apart from cultivated barleys from Asia and

America. This phenomenon was assumed to be due to the selection of Australian

specific environment and climate. LaTrobe is the sister line of Hindmarsh because

Latrobe is selected from Hindmarsh with a spontaneous mutation. Similarly, XYH2

is a mutant line from XYH0. Moreover, we investigated the genetic divergence

between Australian cultivated barley and wild barley from ‘the Evolution Canyon’.

612 Mb genomic regions containing 2,649 high-confidence genes were demonstrated

to be under the selection of domestication.

Traditional molecular markers such as RFLP (Sherman et al., 1995), SSR (Varshney

et al., 2007) and array-based markers such as DArT array (Wenzl et al., 2004), SNP

array (Close et al., 2009) have played important roles in primary mapping of many

important agronomic QTLs. But the limited number of these molecular markers

97

cannot meet the requirement of QTL fine mapping. Lack of polymorphic molecular

marker is one of the two major challenges in the process of QTLs fine mapping in

barley. The genomic variants discovered in this study can be used to develop

molecular markers with the highest density. For example, Jia et al. (2016) narrowed

down the target region of the semi-dwarf gene ari-e from 6.8 cM to 0.58 Mb region

using the InDel markers developed based on the Indel information between cv.

Hindmash and Tibetan wild barley W1. Wang et al. (2017) developed Indel markers

based on genetic variants between cv Scope and cv. Vlamingh in a target region of

9.85 Mb. After genotyping with the new InDel markers, the candidate region was

further narrowed down to a 0.482 Mb region only containing 11 high-confidence

genes.

Discovering the genomic diversity is the fundamental for further explore the

invaluable genetic resource in each specie. To discover the genomic diversity in rice,

an international cooperation project was performed to whole genome shotgun

sequencing about 3000 rice accessions (Li et al. , 2014a). This resource has been used

to research the evolution and domestication of rice. Discovering the genomic

diversity of barley germplasm is limited by the large genome size and cost. In this

study, we performed high coverage whole genome sequencing for several major

Austrian barleys. For example, ‘Baudin’ and ‘La_Trope’ are two of the popular

marting barleys widely grown in Australia. This partly uncovered the genomic

diversity in Australian barley germplasms. These genomic diversity information of

Australian core varieties are very important to find out which functional genes are

critical for barley to better adapt to the special environment/climate in Australia. It

remains expensive to perform whole genome sequencing for a large number of

98

barley accessions given the large size of barley genome. Compared with whole

genome sequencing, exon sequencing is another option to discover the genomic

diversity in gene coding regions for species with large genome size like barley

(Mascher et al., 2013b; Warr et al. , 2015). A whole exome capture platform of barley

has been developed, which provides the possibility to selectively capture the coding

regions (about 61.6 Mbs) in barley genome (Mascher et al., 2013b). Using this barley

exome capture platform, a collection of 267 landrace barleys and wild barleys were

sequenced to understand the environmental adaptation of barley (Russell et al. ,

2016). Based on the genomic diversity discovered in this barley germplasm, it was

indicated that seasonal temperature and the amount of rainfall are two of the major

drivers for barley’s environmental adaptation. With the advancing of the DNA

sequencing technology and the reducing cost, it will be possible to investigate the

whole genome wide diversity for a large number of barley germplasms.

99

Chapter 5 BarleyVarDB, a database of barley

genomic variation

Abstract

Barley (Hordeum vulgare L.) is one of the first domesticated grain crops and has

been the fourth most important cereal source for human and animal consumption.

BarleyVarDB is a database of barley genomic variation. It can be publicly accessible

through the website at http://146.118.64.11/BarleyVar. This database mainly

provides three sets of information. Firstly, there are 57,754,224 SNPs and 3,600,663

InDels included in BarleyVarDB, which were identified from high-coverage whole

genome sequencing of twenty barley germplasm, including seven wild barley

accessions from three barley evolutionary original centres and 13 barley landraces

from different continents. Secondly, it makes the latest barley genome reference and

its annotation information publicly accessible, which has been achieved by the

International Barley Genome Sequencing Consortium (IBSC). Thirdly, 522,212

whole genome-wide microsatellites/SSRs, were also included in this database, which

was identified from the barley pseudo molecular genome sequence. In addition,

several useful web-based applicants were incorporated in BarleyVarDB, such as

Jbrowse, BLAST and Primer3. BarleyVarDB is featured by automatically designing

PCR primers for detecting any variants deposited in this database and a friendly

interface for accessing barley reference. In short, BarleyVarDB will benefit the

barley genetic research community by giving them access and making full use of all

100

publicly available barley genomic variation information and barley reference genome

as well as providing them ultra-high density of SNP and InDel markers for molecular

breeding and identification of functional genes with important agronomic traits in

barley.

101

5.1 Introduction

Single Nuclear Polymorphisms (SNPs) and Insertions or Deletions (InDels) are the

two most common types of genetic variations among living organisms. They have

played important roles in examining genetic diversity (Seberg and Petersen, 2009),

positional cloning (Li et al. , 2011; Xue et al. , 2008; Yan et al., 2014), association

mapping (Huang et al. , 2010; Huang et al., 2011b; Huang et al. , 2012b; Yano et al. ,

2016), and evolutionary biology (Huang et al., 2012a; Lin et al., 2014) in the past

decades. Barley (Hordeum vulgare L.) is one of the first domesticated grain crops

(Badr et al., 2000; Zohary et al., 2012) and has been the fourth most important cereal

source for human and animal consumption. It is widely used as a model organism to

research the genetic basis of plant adaptive evolutionary processes and as a vital

genetic repository to explore plant abiotic and biotic stress tolerance, as it can be

grown in diverse and extreme environments such as warm-dry Near East and in cold-

dry Tibet (Nevo, 2014b). The availability of barley reference genome achieved this

year (Mascher et al., 2017) and the dramatic advantage of next-generation

sequencing technology will make it more efficient to explore genomic variation in

barley germplasm, which will accelerate barley genomic and genetic research

progress. There are several genomic variation databases such as RiceVarMap (Zhao

et al., 2015), SNP-Seek (Alexandrov et al. , 2015) and HapRice (Yonemaru et al. ,

2014) conducted in rice and dbSNP in the National Center for Biotechnology

Information (NCBI) (Sherry et al., 2001). There however, hasn’t been any public

database available of genomic variation information in barley until recently, except

several databases for molecular markers such as GrainGenes (Carollo et al., 2005)

102

and barley physical map information of Barlex (Colmsee et al., 2015). At this stage,

a comprehensive database of barley genomic variation is in an urgent need for the

barley genomic research community to share and make good use of the genomic

variation information exploited by different research groups. Here, we constructed a

comprehensive database specializing in barley genomic variation and designated it as

BarleyVarDB. In summary, BarleyVarDB database mainly includes three sets of

data. Firstly, there are 57,754,224 SNPs and 3,600,663 InDels provided in

BarleyVarDB, which were identified from high coverage whole genome sequencing

(~40X) of twenty diverse barley germplasms representing wild barley accessions

from three widely accepted barley evolutionary origins and cultivar barley accessions

from different continents. Secondly, it makes the latest barley genome reference and

corresponding gene annotation information accessible to the research public, which

has been achieved by the International Barley Genome Sequencing Consortium

(IBSC) this year (Mascher et al., 2017). Thirdly, 522,212 microsatellites, also called

simple sequence repeats (SSRs), were included in this database and identified from

the barley pseudo molecular genome sequences. In addition, several friendly and

useful web-based tools were incorporated into BarleyVarDB, such as the genetic

genome browse (JBrowse) (Holton, 2002) for displaying the barley genomic data in

an overall, basic local alignment search tool (BLAST) (Altschul et al., 1990) for

mapping sequences against barley genome and annotation sequences and primer3 (Li

et al., 2015) for primer design. BarleyVarDB is publicly accessible through the

website interface at http://146.118.64.11/BarleyVar.

103

5.2 Database construction

The framework of BarleyVarDB is shown as the Figure 5-1a. The first aim of this

database is to share and make full use of the genomic variants (SNPs, InDels) mined

from increasingly accumulated genome sequencing data in barely research

community. These genomic variants can be retrieved by different query keywords

such as variants identifier, target region, candidate gene and so on. The second aim is

to distribute the updated barley genome reference with high quality to the public and

also make it easily accessible by barley genetic researcher without programming

experience through web-based applicants like JBrowse and BLAST. Both the

genomic variants and barley genome reference pave the way for gene mapping of

important agronomic traits, new competitive varieties breeding using molecular

marker assistant selection and research into the adaptational evolution and

domestication of barley.

104

Figure 5-1 Framework, implementation, and date collection of BarleyVarDB.

a, the framework and aims for building BarleyVarDB; b, the overview of the

implementation of BarleyVarDB; c, the geographic distribution map of barley

accessions collected in BarleyVarDB.

BarleyVarDB was mainly implemented using the open-source HTTP server of

Apache2 and the relational database management system of MySQL (Figure 5-1b).

Apache server provide the web interface for database clients to visit and interact with

the database server using internet browsers. All the genomic variations and barley

105

reference genome annotation information are stored and maintained in a MySQL

database server. Clients can retrieve these information through internet browser and

Apache server. Genomic variants database can be maintained and updated using

MySQL server commands in Linux or using Phpmyadmin through web browser.

Besides, some perl, python and PHP scripts were used to make the display of the

retrieved result friendlier as well as carry out some basic functions such as keywords

searching in batch.

At present, we have collected about 3.70 Tbs sequencing data of whole genome

sequencing of twenty-one barley accessions in high-coverage, which represent wild

barley from three evolution centres and cultivar barley from different continents

(Figure 5-1c). The first dataset comprises of six wild barley accessions from the

Tibetan plateau and opposing slopes of “Evolution Canyon” as well as nine cultivar

barley accessions from Australia and Canada. The detailed description of sample

collection and whole genome sequencing for this dataset were given in Chapter 3 and

Chapter 4. The main content of this chapter is to create a barley genomic variation

database to make the genomic information explored in chapter 3 and chapter 4

accessible to barley research community. In short, these barley germplasms were

planted in the glass house and genomic DNA was extracted from leaves of the plants

at three leaf stage using the CTAB (cetyl trimethylammonium bromide) method. The

purity and concentration of the genomic DNA. Sequencing library for each sample

was prepared following the manufacturer’s instruction (Paired-End Sample

Preparation Guide, Illumina, 1005063) and then sequenced on the Illumina HiSeq

2000 platform in BGI-Shenzhen (Beijing Genomics Institute-Shenzhen, Shenzhen,

China). Eventually, about 2.96 Tb high quality reads were produced from these

106

fifteen barley accessions and the raw data submitted to NCBI/SRA with BioProject

accession number PRJNA324520. The second dataset (about 736 Gbs) comprises of

five barley cultivars and one wild barley accession (International Barley Genome

Sequencing et al., 2012), which was performed by Leibniz Institute of Plant Genetics

and Crop Plant Research (IPK). It was downloaded from European Nucleotide

Archive (ENA, EMBL-EBI) with the accession number PRJEB628.

The detailed methodology for mapping reads against reference and genomic

variation detection for the collected sequencing data can refer to Chapter 3 and

Chapter 4, and the summary is given as follows. Firstly, a strict filtration was

performed to remove reads with contamination and trim the poor-quality bases for

each accession. Then, the barley pseudo molecular genome was used as the reference

to map the clean reads using BWA-MEN (Li, 2013). On average, about 98% of clean

reads were mapped to the reference sequence and about 49% of them have unique

mapping positions in the reference with high mapping quality. Next, only those

having unique mapping positions remained and proceeded to detect genomic

variation (SNPs and InDels). The final SNPs and InDels were generated by running

two rounds of variants calling analysis using SAMTools/BCFTools (Hofmann, 2014;

Ibarra et al., 2016) and GATK (Graeber et al., 2012) pipelines. The first round was

performed with SAMTOOLS and BCFTOOLS for all accessions and the result was

controlled based on the variants quality. Then, the calling result of the first round

was used as the input file to guide the realignment around potential InDels and

variants calling for the second round in GATK. The SNPs and InDels were detected

by both SAMTools/BCFTools pipeline and GATK were retained and then strictly

107

filtered mainly based on base quality, mapping quality and variants supporting depth

(Li, 2014).

Currently, we don’t have extra data for the validation. But our analysis pipeline has

been tested using the benchmark datasets, the accuracy is up to 95%. The variants

information in this chapter has been widely used for molecular marker development

in our own group and other groups.

108

5.3 Database description

Search SNP/InDel/SSR by Id: In this interface, web users can retrieve certain

SNP/InDel/SSR by entering their identifiers (such as “SNP|chr1H|000000749”,

“IND|chr1H|000023712”, “SSR|chr1H|000068174”), which is unique in the

database. Two pieces of information are included in the identifier: variation type

(“SNP”, “IND”, “SSR”) and chromosome coordinate (eg. chr1H|000000749). For

example, “SNP|chr1H|000000749” means a SNP at 749 bp in chromosome 1H of

barley reference.

Search SNP/InDel/SSR in target region: In this interface, a target region together

with barley accessions can be given and be used to retrieve SNP/InDel information.

Then the SNP and InDel Information in the target region among the given accessions

can be obtained. For SSR search, a certain region can also be given to acquire SSR

records in the target regions.

109

Figure 5-2 Web interfaces of retrieving SNPs, INDELs, SSRs and gene annotation

in BarleyVarDB.

a, main interface of SNP search input and result output; b, main interface of INDEL

search input and result output; c, main interface of SSRs search input and result

output; d, main interface of gene annotation search input and result output.

Search SNP/InDel/SSR within gene: genomic variants in gene region are quite

useful, because different alleles may result in the change of phenotype and they can

be used as a functional marker to track target genes of agronomic importance. The

genetic effects of each SNP or InDel were assessed based on their locations in gene

110

structure (5’UTR, exon, intron, 3’UTR) and the change of gene coding products

(nonsense mutation, missense mutation, open reading frame shift and so on) using

snpEff (Cingolani et al., 2012). Because the high coverage of whole genome

sequencing, the information of these SNP/InDels location and genetic effects will

also be useful to predict the candidate genes in target region of map-based cloning.

Search polymorphic SNP/InDel between two Samples: Comparison of genetic

background between two parent lines of a mapped population is a common task for

genetic research. In this interface, the user can directly get SNP/InDel sites with

different alleles for two given samples. This information is very useful in developing

polymorphic molecular markers for agronomic QTL mapping or marker-assistance

selection breeding. PCR Primers to examine polymorphic InDels can be

automatically designed after selecting the hyperlinked InDel names. For polymorphic

SNP sites, they can be converted to CAPS markers easily if they are located in the

recognizing sequence of restriction enzymes. Primer automatically design is

performed by a background processing script.

Web-based tools for barley genetic researchers: JavaScript-based Genome

browser (JBrowse) (Skinner et al., 2009) is a user-friendly and interactive web

interface for displaying and manipulating whole genome-wide datasets such as

reference genome sequence, reference annotation information, genomic variations,

gene expression levels and so on. All these large sizes of dataset are stored in

mySQL database which will accelerate the process of retrieving and displaying.

Meanwhile, a BLAST server for the latest version of barley genome reference are set

up in this website. The pre-build BLAST databases including whole genome

111

sequence, coding sequences, transcript sequences and predicted protein sequences of

barley reference. Besides, a web-based PCR primer design system is also

implemented with primer3plus as the core program. Except for the normal primer

design function, some special features for barley genome are added. For example,

users can directly pick up primers by entering the chromosome coordinate of the

target region in barley. It will save barley genetic researchers a lot of time and

increase the efficiency of their research.

112

Figure 5-3 Examples of software applications in BarleyVarDB.

a, example of gene annotations of barley genome reference displayed in Jbrowse; b

search interface of barley reference genome blast server; c example of primer designs

in barley using primer3plus.

113

5.4 Future improvement

We are applying for a Domain Name System (DNS) ID (web site) for this database.

Before the DNS applying has been approved, this database could be accessed

through the current IP address. Later on, we will link the current IP address to the

web site so users could access to the database by either current IP address or a web

site.

As more barley germplasms are sequenced, we will continuously exploit genomic

variation information from more sequenced data and add their genomic variation

information into BarleyVarDB to keep it the most comprehensive database for barley

genomic variation. Meanwhile, more web-based software resources for the barley

genetic research community to explore barley genomic data will be developed and

incorporated in BarleyVarDB. Feedback from users will be taken into consideration

when making BarleyVarDB more user-friendly.

114

Chapter 6 Characterization of genome-wide

variations induced by gamma-ray

radiation in barley using RNAseq

Abstract

Induced mutants not only provide a new approach to increase the diversity of

desirable traits for breeding new varieties, but also benefit characterization of the

genetic basis of functional genes. In the past decades, many mutant genes have been

identified to be responsible for the phenotype changes in mutants in various species

including Arabidopsis, rice and so on. However, the entire features of mutations in

induced mutants and the underlying mechanism of various types of artificial

mutagenesis remain unclear. In this study, we adopt transcriptome sequencing

strategy to characterize all mutations in coding regions in a barley dwarf mutant

induced by gamma radiation. In total, 1,193 genetic mutations were detected in gene

transcription regions introduced by gamma ray radiation. Interestingly, most of them

are concentrated in certain regions in two major chromosomes. Eventually, 140 out

of 26,745 expressed genes were affected due to ga mma radiation and their biological

functions include cellular process and metabolic process and so on. Our results

indicated that mutations induced by gamma radiation are not evenly distributed in

whole genome, but prone to occur in several concentrated regions in chromosomes.

Our study has provided an overview of the feature of the genetic mutations and genes

115

caused by the gamma ray radiation, which will benefit the deeper understanding the

mechanism of radiation mutation and utilizing of them for gene function analysis.

116

6.1 Introduction

In the past century, thousands of new crop varieties have been released from induced

mutants and widely grown (Ahloowalia and Maluszynski, 2001). They have played

an important role in creating desirable agronomic traits including higher yield,

improved quality, and abiotic stress tolerance for crop breeding. It was reported that

new varieties derived from gamma radiation mutagenesis have accounted for up to

64% of all radiation induced varieties from 1930 to 2004 (Ahloowalia et al., 2004).

For instance, ‘Calrose 76’, generated by gamma radiation, had been the prominent

rice variety in American California state until the late 1970. ‘Calrose 76’ is about 25

cm shorter and has higher potential yield than its parent line ‘Calrose’ (Monna, 2002;

Rutger et al., 1977). Meanwhile, artificial mutants have also been widely used to

characterize the genetic basis of functional genes controlling complex agronomic

traits with great importance in crops (Abe et al., 2012; Arite et al., 2007; Arite et al.,

2009; Ishikawa et al., 2005; Yan et al., 2007). For example, it was demonstrated that

the phenotype changing from ‘Calrose’ to ‘Calrose 76’ is due to a 1-bp substitution

induced by gamma radiation in the exon 2 of the rice ‘green revolution gene’ sd-1,

which encodes a gibberellin 20-oxidase (Monna, 2002). Although mutants are widely

used both in breeding and genetic research, the molecular features and underlying

mechanism of artificial mutations remain unclear.

The most widely used mutant developing techniques include insertion mutagenesis,

chemical mutagenesis and physical mutagenesis (Wang et al., 2013). Different

mutagenesis technologies are featured by diverse mechanisms and mutation patterns.

Comprehensive analysis of all flanking sequence tags isolated from a T-DNA

117

insertion library including 31,443 independent transformants showed that T-DNA

insertions were non-random distribution in chromosomes and happened more

frequently in the distal ends and less in the centromeric regions (Wu et al. , 2003;

Zhang et al. , 2007; Zhang et al. , 2006a). Compared with insertion mutagenesis,

chemical mutagenesis such as Ethyl Methane-Sulfonate (EMS) can induce high-

density mutations with random distribution (Wang et al., 2013). It was suggested that

EMS cause mispairing by introducing chemical modification of nucleotides and

mostly results in transition of C/G to T/A via C-to-T changing (Greene et al., 2003;

Kim et al., 2006; Koornneeff et al., 1982). It was also proved that EMS is able to

result in base-pair deletion or insertion as well as chromosome fracture (Sega, 1984).

Ionizing radiation was reported to be prone to cause chromosome alterations such as

deletions and inversions, which is the consequence of repair of ionizing radiation-

induced damage (Sankaranarayanan, 1991; Shirley et al. , 1992). Recently, one

research was conducted for six rice mutants induced by gamma radiation using

whole genome resequencing and no specific feature except mutations were almost

evenly distributed in each chromosome (Li et al., 2016c). However, the features of

mutations induced by gamma radiation in barley have not been characterized yet.

With the rapid advance of technologies in DNA sequencing and molecular biology, it

was possible to investigate all the types of mutations within whole genome level (Li

et al., 2016a; Li et al., 2017; Nordstrom et al., 2013a). It has been proved to be one of

the most efficient approaches to identify mutant genes by integrating bulk

segregation analysis and Restriction site Associated DNA Sequencing (RAD-Seq)

(Nakata et al., 2018; Takagi et al. , 2013a; Takagi et al., 2013b). Whole genome

sequencing and comparative analysis was also attempted to identify the casual genes

118

in mutants (Nordstrom et al., 2013b) or characterize all the mutations in a large

mutant library (Li et al., 2016a; Li et al., 2017). For species with large genome size

like barley (4.83 Gb) (Beier et al. , 2017; Mascher et al., 2017), performing whole

genome sequencing is still beyond the affordability of most researchers. In contrast,

RNAseq is more effective than whole genome sequencing to investigate genetic

variants in transcription regions, which are more likely to be causal for the change of

phenotype.

In this study, transcriptome sequencing was adopted to identify all the genetic

mutations in the transcription regions induced by gamma ray radiation in a mutant

Vla-MT derived from a barley cultivar Vlamingh. We then anchored these gamma

radiation mutations into annotated gene models and evaluated the genetic effects of

them on the gene coding products. We also investigated the pattern of the

distribution along chromosomes and base changes of these genetic mutations.

Finally, the potential functions of genes with coding product changes were predicted

based on protein sequence similarity. Overall, this study will provide insight for

comprehensive understanding the mechanism of mutation induced by gamma

radiation in plants and further accelerate effective application of radiation mutagen in

exploiting important genetic resource and improving crop breeding.

119

6.2 Methods

Plant materials and transcriptome sequencing

Vlamingh (Vla-WT) is a malt barely variety with high yield released from

Department of Primary Industry and Regional Development Western Australia. Two

kilogram of Vla-WT seeds were treated with 200 Gry 60

Co gamma radiation. Vla-

MT was one mutant selected from about 10,000 M2 individuals. Compared with its

wild type, Vla-MT is dwarfed and has more tillers. Vla-MT and the wild type

Vlamingh were grown in the Murdoch university Glasshouse. Total RNA of two

replicates of Vla-MT and Vlamingh were extracted from leaf at the six-week-old

seeding stage. RNAseq library was prepared following the protocol of TruSeq Pre-

Kit (Illumina) and fragments with length of 250-500 bp were selected and sequenced

on the HISeq 2000 platform in Beijing Genomics institute.

Reads mapping and SNP/InDel calling

Raw reads (90 bp paired-end) produced by the sequencer were strictly filtered to

remove reads with low quality using Sickle (NA and JN, 2011) (version 1.33). The

quality of cleaned data was assessed using FASTQC toolkit (Andrews). Clean reads

of each sample were mapped to the latest barley reference genome (Mascher et al. ,

2017) using the aligner HISAT (Kim et al. , 2015) (version 2.0.5). PCR duplications

were filtered using SAMtools (Li et al., 2009) (version 1.4.1). Only reads with

unique mapping position in reference were proceeded to calling SNPs and InDels for

each sample. This step was performed using the pipeline consisting of SAMtools and

BCFtools in SAMtools package.

120

Eventually, SNPs and InDels were filtered with following criterial: ⑴ all variants

within the low-complexity regions (LCRs) were removed; ⑵ all variants with calling

quality less than 50 and mapping quality less than 40 were all removed; ⑶ if the read

number supporting non-reference allele is bigger than five and its percentage larger

than 0.2, these variant sites will be retained. Otherwise, the variant site will be

filtered out. ⑷ any SNPs within 15 bp flanking region of an InDel will be filtered

out.

Genetic effects assessment and function annotation for mutations

Genetic effect of each mutation was predicted based on their position in gene models

and the changing of coding product using the program of SnpEff. Then those genes

with significant genetic effect were chosen to perform function and pathway

annotation. This step was achieved using AutoFACT program based on the

homology of their encoding protein with well researched genes and proteins stored in

databases of NR, Swiss-Prot, KEGG, COG and GO.

121

6.3 Results

6.3.1 SNPs and InDels screening through transcriptome

sequencing

To understand the feature of genetic mutations induced by gamma ray radiation, a

barley cultivar Vlamingh was irradiated with 200 Gy gamma rays. One dwarf mutant

(Vla-MT) was selected (Figure 6-1a). In this study, our focus are genetic mutations

in coding regions, because these mutations are more likely to cause phenotype

changes. Transcriptome sequencing strategy was adopted and performed for both

wild type Vlamingh (Vla-WT) and a dwarf mutant Vla-MT to identify genetic

mutations in coding regions caused by gamma ray radiation. Two biological

replicates were conducted for both Vla-MT and Vla-WT to increase the accuracy of

transcriptome sequencing and variant detecting. Raw data out from the sequencer of

each sample was strictly filtered and qualified to make it reliable for further

bioinformatic processing. As a result, about 4.67 Gb (on average) clean data with

high quality was generated for each sample. Then, the clean reads of each sample

were aligned to the barley reference genome with the annotated high confidence

genes as guidance. Eventually, about 95% of clean reads of each sample were

mapped to the barley genome and around 89% out of them had unique alignment

positions in barley chromosomes.

122

Figure 6-1 Phenotype of Vla-WT and Vla-MT and distribution of mutations in

chromosome 5 and chromosome 7.

a, overview of the phenotype of wild type Vla-WT (right) and mutant type Vla-MT

(left). b, distribution of detected mutations in chromosome 5 and 7. X-axis indicates

the physical coordination in chromosome 5 and Y-axis indicates the mutation

numbers for each 10 Mb regions.

In the process of SNPs and short InDels calling, only those uniquely aligned reads

were used to increase accuracy. In theory, the variants relative to the barley reference

in two repeats of each sample should be the same if the sequencing errors and

negative result could be avoided. To our surprise, the concordance rate of SNP and

InDel calling is 99.98% (45,637 out of 45,645) for two repeats of Vla-MT and

99.99% (46,582 out of 46585) for two repeats of Vla-WT. This means that the

potential errors induced in the process of sequencing and variants calling have been

properly controlled (to 0.02% - 0.01%), and the quality of detected variants are

123

reliable to further investigate the genetic mutations induced by ga mma ray radiation

in the mutant.

6.3.2 Genetic mutations induced by gamma ray radiation

Vla-MT is derived from Vla-WT by only exposing it to gamma ray radiation. So, all

the sequence difference between Vla-MT and Vla-WT are assumed to be genetic

mutations induced by gamma ray radiation. We aligned the transcriptomic sequences

from both Vla-MT and Vla-WT to the Morex reference genome sequence. It was

found that 97 % (42,686 out of 43,879) variants were identical between Vla-MT and

Vla-WT, which reflected the difference between Morex and Vlamingh. There are

1,193 variants consistently detected in two repeats of Vla-MT and Vla-WT, which

are assumed to be the genetic mutations induced by gamma ray radiation.

It was found that the 1,193 genetic mutations induced by gamma ray radiation are not

evenly distributed in each chromosome (Table 6-1). Interestingly, they are mainly

located in chromosomes 5H and 7H, except three mutations were detected in

chromosome 1H and only one in each of 2H, 4H and 6H. Chromosome 3H is the

only one where no mutation was identified. We have also found that genetic

mutations in chromosome 5H and 7H are only concentrated in several certain regions

(Figure 6-1b). This phenomenon is particularly obvious on chromosome 5H, in

which 550 genetic mutations are in 40 Mb genome regions (chr5H:590 Mb-630 Mb).

On chromosome 7H, they are not as concentrated as in 5H, but they are still in a

relative continuous region (Chr7H: 230 Mb-520 Mb).

124

Table 6-1 Genetic mutation counts in each chromosome.

Chromosome Length (Mb) Variants

1H 558.54 3

2H 768.08 1

3H 699.71 0

4H 647.06 1

5H 670.03 550

6H 583.38 1

7H 657.22 612

Un 249.77 25

Total 4,833.79 1,193

Among the 1,193 mutations induced by gamma ray radiation, 96.65% of them are

SNPs, except for 26 short deletion and 14 short insertions. The exact number of base

changes are given in table 2. There were four dominant changes: AG, CT,

GA, TC. Both Adenine (A) and Guanine (G) are pure bases with only two

different chemical groups in the side chain.

6.3.3 Genes affected by mutations

We also investigated which genes these genetic mutations are located in and which

regions they fell into by comparing their physical positions with the predicted gene

models. In theory, all of them should be anchored to transcription regions because

only mRNA fragments were selected to conduct sequencing in this study. It is

125

interesting to notice that there were still 14% of them mapped to the intergenic

regions and 6% of them to intron regions (Figure 6-2a). We speculated that there

should be some potential gene regions having not been included into the gene

annotations or alternative splicing events exist between different certain germplasm

accessions. Apart from these part mutations, 33% of all the detected genetic

mutations are anchored into gene exon regions and others are fell into upstream, 5’-

UTR, 3’-UTR and downstream regions, respectively (Figure 6-2a). Meanwhile, we

also predicted the genetic effects caused by genetic mutations anchored in gene

regions based on changing of coding product. As is shown in right pie chart of the

Figure 6-2a, there are 209 missense mutations, three nonsense mutations, two

frameshift mutations and one disruptive frameshift mutations. The others are

synonymous mutation, which are assumed to cause less effects to the coding

products.

126

Figure 6-2 Genetic effects of mutations and biological function of genes with mutant

genes.

a, genetic effect prediction of all detected mutations: left pie shows the percent of

mutations anchored into different types of genomic regions and right shows the

percent of genetic effects of mutations anchored in exon regions. b, biological

function classification of effected genes in cellular component, molecular function

127

and biological function levels. X-axis indicates the different types of functions and

Y-axis indicates the number and percent of genes fell into the classification of each

biological function description.

The 215 genetic mutations including missense, nonsense, frameshift, and disruptive

frameshift are assumed to be mutations with high level of risks, which are within

exon region and change the coding products. These 215 mutations are harboured in

140 genes. According to the protein sequence similarity with other well researched

proteins, these 140 genes are annotated with gene ontology. Their functions are

classified into three levels including biological process, molecular function, and

cellular component (Figure 6-2b). In biological process level, cellular process and

metabolic process are the two dominating parts of the high-level mutation genes,

which are followed by the cellular localization and biological regulation. In

molecular function level, binding and catalytic are the two main types for the

mutation genes.

128

6.4 Discussion

Most of previous reported mutants have been reported to impact qualitative traits

such as single tiller mutant (Li et al., 2003) and never flowering mutant (Wu et al. ,

2008) in rice and six-rowed mutant in barley (Komatsuda et al., 2007). In this

chapter, we analysed the mutations in coding regions in one mutant induced by

gamma radiation. Vla-MT, featured by increased tillers and reduced plant height

(Figure 6-1a), was initially selected as qualitative mutant. In this study, we found

that there are up to 1,193 genetic mutations detected merely in gene transcription

regions of the barley mutant generated by gamma ray radiation. And they are

concentrated in two major chromosomes rather than evenly distributed in whole

genome. Our result also suggested that even the genetic mutations in the two major

chromosomes are only within several small fragments of genomic regions. It was

assumed that the two regions with concentrated mutations may be due to the

segregation of the heterozygous regions in the wild type of Vlamingh. But this

possibility has been excluded by investigating the variants result in the wild type

Vlamingh using whole genome shotgun sequencing data with high coverage.

Another limitation is that we only surveyed the mutation pattern in one gamma

radiation mutant. The current result needs to further validate by an enlarged mutant

population. The causal gene responsible for the increased tiller number and reduced

plant height has been isolated by map-based approach and confirmed in the present

study (data not shown). The result of this study suggests that a mutant selected from

gamma radiation may also exist wider genetic variat ions for other genes and traits. In

other words, it is possible to improve quantitative traits from gamma-radiation, as up

129

to 140 genes were modified from a single mutant in the present study.Except for 26

short deletion and 14 short insertions, SNP mutation accounted for about 96.65% of

all the identified mutations. For SNP mutations, AG, CT, GA and TC are

the four major types. It was assumed that the chemical groups in the side chain prone

to be damaged by gamma ray radiation. Then, they are going to be repaired by the

DNA repair mechanism (Collins, 2004; Friedberg, 2003; Rastogi et al., 2010). Some

of them can be repaired and recovered to its original chemical structure, and the

others prone to be repaired and changed to their similar chemical structures

(Friedberg, 2003). The situation is assumed to be the same for Cytosine (C) and

Thymine (T), because they share the same monocycle and differ in two chemical

groups of the side chain.

Among the 1,193 genetic mutations, 450 out of them are anchored in gene exon

regions and 215 of them are predicted to be genetic mutations with high-level of

genetic effect, causing disruptive in-frameshift, frameshift, nonsense, and missense

mutations. Eventually, change of coding products was predicted for mutations in 140

genes due to the gamma radiation. This result is contrast with traditional views that

only a few of genes are affected and responsible for the mutant phenotype (Arite et

al., 2009; Monna, 2002; Yan et al., 2007).

Although whole genome resequencing has been used to identify mutations and

isolate mutant genes in species with simple complexity of genomes in recent years

(Li et al. , 2016a; Li et al. , 2017), it is not affordable for most researchers without

abundant funding. To achieve reliable results, the minimum depth of genome

sequence coverage was suggested as 30 times of the genome size (Sims et al. , 2014).

130

This means that it would need about 150 Gb data for each sample to investigate

mutation in whole genome for barley (Beier et al., 2017; Mascher et al. , 2017). As

most variants related to gene function-loss were reported to be in gene coding

sequence, it should be more efficient to identify the functional mutants by RNA

sequencing. In this study, an average size of 4.6 Gb RNAseq data was obtained for

each sample. This equals to about 78.41 folds of the total length of barley high

confident gene transcripts. Thus, the RNAseq is 33 times more efficient than the

whole genome sequencing. Meanwhile, RNAseq data can also be used to analysis

differential expression genes between mutant and wild plant, which was critical to

understand the genetic and biological basis of mutant genes. Besides, novel

transcripts can be detected to improve the gene annotation of genome. However,

detecting variants by RNA sequencing largely depends on the expression of potential

genes, which means variants in the genes which do not express at the sample

collection stage will be ignored. Exome capture sequencing is another option

targeting coding regions (Warr et al., 2015). There is one barley exome capture

platform developed to enrich exon regions in barley genome (Mascher et al., 2013b).

This exome capture platform is based on in-solution hybridization technology, which

could be used to enrich 61.6 Mb coding regions of barley genome (Mascher et al.,

2013b). A collection of 267 georeferenced landrace barleys and wild barleys were

sequenced using this exome capture platform (Russell et al., 2016). Both RNA

sequencing and exome capture sequencing largely reduced the genomic complexity

and sequencing cost. The disadvantage of transcriptome sequencing and exome

sequencing is that some variants located in intron region and noncoding region may

be missed.

131

Over the past several decades, map-based cloning has played a critical role in

identifying genes related to important agricultural traits in most crop species and

understanding the genetic basis of plant development (Fan et al., 2006; Li et al.,

2011; Li et al., 2014b; Xue et al., 2008). As molecular marker density has increased

dramatically with the application of genotyping-by-sequencing, the largest challenge

for map-based cloning is population size, which need to be large enough to produce

sufficient recombination events (Wang et al., 2011). However, developing a large

population to produce sufficient recombination events is laborious and expensive and

it is a challenge for species with large genome size such as barley. The present study

through combination of induced mutation and RNA sequencing provided one

important complementary strategy to classical biparental cross-mapping identifying

function genes.

132

Chapter 7 Characterization of a barley mutant

with high-tillering and narrow leaf

Abstract

Tiller number is one of most important agronomic traits and determines the grain

yield in cereal crops. Several key genes controlling the tiller development have been

isolated in rice and Arabidopsis in the past decade. However, the genetic basis of

tiller development is still not clear. In this study, we attempt to narrow down the

candidate genes responsible for increased tiller number by comparing genomic

variants in the target region between a pair of near-isogenic lines. The target region

delimited by two SNP markers covers about 10 Mb and contains 139 high-

confidence genes (Druka et al., 2011). In total, 225 variants were detected in the

target region between the near-isogenic lines after transcriptome sequencing. Four

genes in the target region contain frameshift mutations and eleven genes contain

missense mutations. These fifteen genes were assumed to be candidate genes

responsible for the mutant phenotype. This study provides new insights to accelerate

the isolation of functional QTLs by combining traditional linkage mapping and

transcriptome sequencing.

133

7.1 Introduction

Tiller number is one of the three major components of grain yield in cereal crops.

The development of tiller involves two stages: formation of dormant axillary buds

and outgrowth. The final number of tillers is co-defined by those two processes.

Cytokinin, auxin and strigolactone are three plant hormones participating in the

regulation of bud outgrowth (Domagalska and Leyser, 2011). Cytokinin is an

activator of dormant axillary buds, which is mostly synthesized in root and

transported upwards in the xylem (Tanaka et al., 2006). Auxin is assumed to

indirectly inhibit the outgrowth of axillary, because it does not enter the buds

(Bonner, 1933; Zhao, 2010). Auxin is mostly synthesized at the shoot apex and

transported downward through polar auxin transport stream. Strigolactone supresses

the activation of dormant axillary buds, which can be produced by both the shoots

and the roots (Gomez-Roldan et al., 2008; Lin et al., 2009).

In past decade, several genes have been identified in the pathway of Strigolactone

(SL) regulating tiller development in rice and Arabidopsis. For example, D27 (Lin et

al., 2009), HTD1 (Zou et al., 2006), and D10 (Arite et al., 2007) are required for

synthesizing SLs. D27 mainly expresses in vascular cells of roots and shoot and

encodes an iron-containing protein. Mutant d27 exhibits increased tillers and dwarf

phenotype in rice, which could be rescued by exogenous SLs (Lin et al., 2009).

HTD1 is a rice orthologue of MAX3 in Arabidopsis, which encodes a carotenoid

cleavage dioxygenase, which is required for synthesizing SLs (Booker et al., 2004;

Zou et al. , 2006). Mutant htd1 exhibits high number of tillers and reduced plant

height, which could be rescued by the introduction of Pro35S:HTD1. D10 is a rice

134

orthologue of MAX4 in Arabidopsis encoding a carotenoid cleavage dioxygenase 8.

D14 (Nakamura et al. , 2013), D3 (Ishikawa et al., 2005), D53 (Jiang et al. , 2013;

Zhou et al. , 2013), and IPA1(Jiao et al. , 2010; Miura et al. , 2010) participate in the

process of SLs signalling. D14 encodes a α/β-hydrolase protein, which participates in

the process of SLs perception (Chevalier et al. , 2014; Nakamura et al., 2013). D3 is a

rice orthologue of MAX2 in Arabidopsis, which encodes an F-box leucine-trich

repeat (Ishikawa et al., 2005). Mutant d53/e9 exhibits increased tiller and dwarf

phenotype and is resistant to the treatment of exogenous SLs (Zhou et al. , 2013).

Knocking down the expression of D53 could rescue the mutant phenotype in d14 and

d3, while overexpression has no effect on the phenotype. Meanwhile, degeneration

of D53 could be detected in the mutant d27 with the treatment of exogenous SLs, but

not in d3 or d14. So Zhou et al. (2013) and Jiang et al. (2013) concluded that D53

should be the downstream of D14 and D3. IPA1 reduces rice tiller number and

encodes SOUAMOSA PROMOTER BINDING PROTEIN-LIKE 14 (OsSPL14)

(Jiao et al., 2010; Miura et al., 2010). Song et al. (2017) demonstrated that IPA1 is a

direct downstream target of D53 and D53 negatively regulates the expression of

IPA1. However, little is known about the genetic basis and molecular mechanism of

tiller development in barley.

In this study, we aim to narrow down candidate genes responsible for the mutant

phenotype in a high-tillering mutant using transcriptome sequencing.

135

7.2 Methods

Materials

BW370 was obtained from NordGen genebank (http://www.nordgen.org/, stock

number BW370 or BGS-73) (Druka et al., 2011). BW370 is a near isogenic line

(NIL) of Bowman with a fragment in chromosome 2H from an X-ray induced mutant

of cv. Proctor. The generation of material BW370 was shown in Figure 7-1. Briefly,

BW370 was generated by six times repeated backcross using Bowman as recurrent

parent with the Proctor mutant as donor parent. BW370 is featured by narrower leaf,

higher number of tiller and grass

http://www.nordgen.org/

136

Figure 7-1 The generation of BW370.

137

Transcriptome sequencing

Total RNA of two replicates of BW370 were extracted from leaf four weeks after

sowing. RNAseq library was prepared following the protocol of TruSeq Pre-Kit

(Illumina) and fragments with length of 250-500 bp were selected and sequenced on

the HISeq 2000 platform in Beijing Genomics Institute-Shenzhen. RNA sequencing

generated 35,173,858 and 34,832,902 clean reads with read length of 90 bp for two

replicates of BW370. Whole genome sequencing data of Bowman was downloaded

from EMBL/ENA with the accession ID of ERP001449, which was produced by

International Barley Genome sequence consortium in 2012 (International Barley

Genome Sequencing et al., 2012).

Reads mapping and variants detection

Raw reads (90 bp paired-end) produced by the sequencer were proceeded to filter

and trim reads with poor base quality and sequencing adapter contamination using

cutadapt. The quality of cleaned data was assessed using FASTQC toolkit

(Andrews). Clean reads of each sample were mapped to the latest barley reference

genome (Mascher et al., 2017) using STAR (version 020201) with default arguments

(Dobin et al. , 2013). PCR duplications were filtered using SAMtools (Li et al. , 2009)

(version 1.4.1). Only reads with unique position in reference were preceded to calling

SNPs and InDels for each sample. This step was performed using the pipeline

consisting of SAMtools and BCFtools in SAMtools package.

Genetic effects prediction and gene function annotation

138

Genetic effect of detected variants were assessed using SnpEff (Cingolani et al. ,

2012) based on the relative position of the variants in the annotated gene model and

how they change the coding products. Gene function annotation was performed using

AutoFACT based on their sequence homology with well characterized genes stored

in database of NR, Swiss-Prot, KEGG, COG and GO.

139

7.3 Result

7.3.1 Phenotype of BW370 and Bowman

BW370 is a Near Isogenic Line (NIL) of barley cultivar Bowman. It was developed

by crossing an X-ray induced mutant of Proctor with bowman and the offspring with

mutant phenotype was backcrossed five times with cv Bowman as recurrent parent

(Druka et al., 2011). As shown in the Figure 7-2a, BW370 is featured by narrower

leaf and more tillers than Bowman. The photo was taken four weeks after sowing. At

that stage, the leaf length was about 18 cm in BW370 and about 25 cm in Bowman

(Figure 7-2b). The blade width ranged from 0.40 to 0.60 cm in BW370, while it

ranged from 0.8 to 1.0 cm in Bowman (Figure 7-2c). The average length of leaves

was about 19 cm in BW370 and 21 cm in Bowman. The morphological feature of

Vla-MT and its wild type cv. Vlamingh was shown in Figure 6-1a in previous

chapter. Briefly, Vla-MT is also featured by dwarf habit with narrow leaf and high

number of tiller.

140

Figure 7-2 Phenotype of BW370 and Bowman.

a, Photos of mutant BW370 and wild plant Bowman; b, photos of the flag leaf of

wild plant Bowman and BW370; c, box plot of leaf width of BW370 and Bowman;

d, box plot of leaf length of BW370 and Bowman.

141

7.3.2 Narrow down candidate genes by RNAseq

According to previous study, the target region of the mutant gene in BW370 was

reported to be delimited by SNP marker 1_0214 and 1_1250 (Druka et al. , 2011). It

was based on genotype comparison of 3,072 SNPs between BW370 and its recurrent

parent Bowman. After anchoring the sequences of 1_0214 and 1_1250 to barley

genome reference, their physical positions were chr2H:672,009,517 and

chr2H:682,243,690. The candidate region covers 10,234,173 bp. In this candidate

region, there are 139 annotated high-confidence genes.

In the target region (about 10 Mb), 225 variants were commonly detected in the two

replicates of BW370 compared with cv. Bowman. The variants include 196 SNPs, 15

insertions and 14 deletions. Effects of the variants were predicted based on their

position in the gene models and change of the coding products. Frameshift and

missense are two types of variants which change the coding product. Four genes

were with frameshift mutations and eleven genes with missense mutations (Table

7-1).

Table 7-1 Candidate genes and function description

No. Gene Chr Start End

Function

Description

frame

shift

missens

e

1

HORVU2Hr1

G096070.9

chr2H

6728951

70

6728989

32

CYCLIN B2;4 2 0

142


Function

Description

frame

shift

missens

e

2

HORVU2Hr1

G096080.12

chr2H

6729295

72

6729344

77

Transmembrane

protein 184C

1 0

3 HORVU2Hr1

G096250.4

chr2H 6735291

37

6735312

88

L-ascorbate

oxidase homolog

1 0

4

HORVU2Hr1

G096300.1

chr2H

6738486

12

6738567

34

Lysine-specific

histone

demethylase 1

homolog 3

1 0

5

HORVU2Hr1

G096360.13 chr2H

6741562

25

6741581

28

Aquaporin-like

superfamily protein

0 2

6 HORVU2Hr1

G096820.1

chr2H 6769408

21

6769417

24

Cactin 0 2

7

HORVU2Hr1

G096960.3

chr2H

6771914

75

6771991

59

glutathione

peroxidase 6

0 1

8

HORVU2Hr1

G097250.5

chr2H

6782345

39

6782538

90

serine

carboxypeptidase-

like 19

0 1

143


Function

Description

frame

shift

missens

e

9

HORVU2Hr1

G097270.14 chr2H

6782695

89

6782965

30

serine

carboxypeptidase-

like 19

0 1

10

HORVU2Hr1

G097380.3

chr2H

6786213

49

6786240

87

pectinesterase 11 0 1

11

HORVU2Hr1

G097800.8 chr2H

6804574

46

6804615

02

Cysteine and

histidine-rich

domain-containing

prot

0 1

12 HORVU2Hr1

G098010.1

chr2H 6816470

64

6816488

28

nudix hydrolase

homolog 8

0 1

13

HORVU2Hr1

G098020.29 chr2H

6817785

57

6817878

64

U3 small nucleolar

RNA-associated

protein 10

0 1

14

HORVU2Hr1

G098100.2

chr2H

6819241

54

6819270

69

Disease resistance

protein

0 1

144

7.4 Discussion

In this study, we adopted transcriptome sequencing to compare the sequence

difference of a target region with about 10 Mb between a pair of NILs. There are 109

high-confidence genes annotated in the target region. Transcriptome sequencing

reveal that four genes in the target region have frameshift mutations and eleven genes

have missense mutations. So, the eleven genes are more likely to be responsible for

the mutant phenotype in BW370. But more precise experiments are needed to

validate which exact gene causes the mutant phenotype and which pathway it plays

role in to regulate the tiller development in barley.

Our study demonstrated that combining the traditional gene mapping method

featured by biparental population and high throughput sequencing technology can

accelerate the identification of functional QTLs. The dramatic advance of DNA

sequencing makes it possible to detect all the sequence variants in whole genome in

single nucleotide resolution. Barley is diploid (2n=2x=14) with an estimated haploid

genome of 5.1 Gb, which is 1.5 times of human genome size and 13.4 times of rice

genome size. As a result, comparing sequence difference between individuals by

whole genome sequencing is quite expensive in barley. In contrast, RNA sequencing

is a good alternative to investigate sequence difference in barley. RNAseq focuses on

the interested region rather than whole genome, so it is an efficient strategy to

investigate genetic diversity only in gene coding regions.

Map-based cloning and genome-wide association studies are the two popular

approaches to identify functional genes in crops such rice (Huang et al. , 2010; Huang

et al., 2011b; Wang et al., 2011). Up to 2000 genes controlling important agronomic

145

traits have been cloning in rice, which provides vital gene resources for rational

breeding of competitive varieties in rice (Yao et al., 2018; Zeng et al., 2017).

However, the functional gene isolation in barley is challenged by the large genome

size. In species with large genome size like barley, gene mapping using mutant

approved to be a relatively efficient approach. Bowman mutant-containing isogenic

lines are the major mutant resource in barley (Druka et al., 2011). In recent years,

dozens of functional genes related to important agronomic traits have been isolated,

like VRS2 (Youssef et al., 2017), VRS3 (Bull et al. , 2017), Zeo (Houston et al., 2013).

Combining mutant resource with DNA high throughput sequencing technology

provide great convenience for functional gene mapping. Jiao et al. (2018) performed

whole genome sequencing to one mutant bulk and one wild type bulk and identified

MSD1 regulating pedicellate spikelet fertility sorghum. Given the 5.1 Gbs of barley

genome size, it remains expensive to perform whole genome sequencing for gene

mapping in barley. (Mascher et al., 2014) adopted exome capture sequencing

strategy to isolate a mutant gene responsible for the many-noded dwarf (mnd)

phenotype. This study set a good example to isolate mutant gene by high throughput

sequencing. In this chapter, we have detected all the mutations in gene coding

regions of the target fragment using RNA sequencing. However, we still cannot

pinpoint the causal gene for the mutant phenotype because the background mutation

cannot be excluded. Mutant gene mapping by RNA sequencing could only be used to

isolated genes which express in the sample collection stage. In future, an F2

segregation population is still required and used to exclude the background

mutations.

146

Reference

Abe, A., Kosugi, S., Yoshida, K., Natsume, S., Takagi, H., Kanzaki, H., Matsumura,

H., Yoshida, K., Mitsuoka, C., Tamiru, M., et al. (2012). Genome sequencing reveals

agronomically important loci in rice using MutMap. Nat Biotechnol 30, 174-178.

Ahloowalia, B.S., and Maluszynski, M. (2001). Induced mutations - A new paradigm

in plant breeding. Euphytica 118, 167-173.

Ahloowalia, B.S., Maluszynski, M., and Nichterlein, K. (2004). Global impact of

mutation-derived varieties. Euphytica 135, 187-204.

Alexandrov, N., Tai, S., Wang, W., Mansueto, L., Palis, K., Fuentes, R.R., Ulat, V.J.,

Chebotarov, D., Zhang, G., Li, Z., et al. (2015). SNP-Seek database of SNPs derived

from 3000 rice genomes. Nucleic Acids Res 43, D1023-1027.

Alter, S., Bader, K.C., Spannagl, M., Wang, Y., Bauer, E., Schon, C.C., and Mayer,

K.F. (2015). DroughtDB: an expert-curated compilation of plant drought stress genes

and their homologs in nine species. Database (Oxford) 2015, bav046.

Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. (1990). Basic

local alignment search tool. J Mol Biol 215, 403-410.

Andrews, S. FastQC: a quality control tool for high throughput sequence data.

Andrews, S. (2014). FastQC: a quality control tool for high throughput sequence

data. Version 0.11. 2. Babraham Institute, Cambridge, UK http://www bioinformatics

babraham ac uk/projects/fastqc.

http://www/

147

Arabidopsis Genome, I. (2000). Analysis of the genome sequence of the flowering

plant Arabidopsis thaliana. Nature 408, 796-815.

Arite, T., Iwata, H., Ohshima, K., Maekawa, M., Nakajima, M., Kojima, M.,

Sakakibara, H., and Kyozuka, J. (2007). DWARF10, an RMS1/MAX4/DAD1

ortholog, controls lateral bud outgrowth in rice. Plant J 51, 1019-1029.

Arite, T., Umehara, M., Ishikawa, S., Hanada, A., Maekawa, M., Yamaguchi, S., and

Kyozuka, J. (2009). d14, a strigolactone-insensitive mutant of rice, shows an

accelerated outgrowth of tillers. Plant Cell Physiol 50, 1416-1424.

Ariyadasa, R., Mascher, M., Nussbaumer, T., Schulte, D., Frenkel, Z., Poursarebani,

N., Zhou, R., Steuernagel, B., Gundlach, H., Taudien, S. , et al. (2014). A sequence-

ready physical map of barley anchored genetically by two million single-nucleotide

polymorphisms. Plant Physiol 164, 412-423.

Atwell, S., Huang, Y.S., Vilhjalmsson, B.J., Willems, G., Horton, M., Li, Y., Meng,

D., Platt, A., Tarone, A.M., Hu, T.T., et al. (2010). Genome-wide association study

of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature 465, 627-631.

Badr, A., Muller, K., Schafer-Pregl, R., El Rabey, H., Effgen, S., Ibrahim, H.H.,

Pozzi, C., Rohde, W., and Salamini, F. (2000). On the origin and domestication

history of Barley (Hordeum vulgare). Mol Biol Evol 17, 499-510.

Bauer, E., Schmutzer, T., Barilar, I. , Mascher, M., Gundlach, H., Martis, M.M.,

Twardziok, S.O., Hackauf, B., Gordillo, A., Wilde, P., et al. (2017). Towards a

whole-genome sequence for rye (Secale cereale L.). Plant J 89, 853-869.

148

Beier, S., Himmelbach, A., Colmsee, C., Zhang, X.Q., Barrero, R.A., Zhang, Q., Li,

L., Bayer, M., Bolser, D., Taudien, S., et al. (2017). Construction of a map-based

reference genome sequence for barley, Hordeum vulgare L. Sci Data 4, 170044.

Benjamini, Y., and Hochberg, Y. (1995). Controlling the False Discovery Rate - a

Practical and Powerful Approach to Multiple Testing. J R Stat Soc Series B Stat

Methodol 57, 289-300.

Bikard, D., Patel, D., Le Mette, C., Giorgi, V., Camilleri, C., Bennett, M.J., and

Loudet, O. (2009). Divergent evolution of duplicate genes leads to genetic

incompatibilities within A. thaliana. Science 323, 623-626.

Bird, C.E., Fernandez-Silva, I., Skillings, D.J., and Toonen, R.J. (2012). Sympatric

Speciation in the Post “Modern Synthesis” Era of Evolutionary Biology. Evol Biol

39, 158-180.

Bolger, A.M., Lohse, M., and Usadel, B. (2014). Trimmomatic: a flexible trimmer

for Illumina sequence data. Bioinformatics 30, 2114-2120.

Bolnick, D.I., and Fitzpatrick, B.M. (2007). Sympatric Speciation: Models and

Empirical Evidence. Annu Rev Ecol Evol Syst 38, 459-487.

Bonner, J. (1933). Studies on the Growth Hormone of P lants: IV. On the Mechanism

of the Action. Proc Natl Acad Sci U S A 19, 717-719.

Booker, J., Auldridge, M., Wills, S., McCarty, D., Klee, H., and Leyser, O. (2004).

MAX3/CCD7 is a carotenoid cleavage dioxygenase required for the synthesis of a

novel plant signaling molecule. Curr Biol 14, 1232-1238.

149

Buckler, E.S., Holland, J.B., Bradbury, P.J., Acharya, C.B., Brown, P.J., Browne, C.,

Ersoz, E., Flint-Garcia, S., Garcia, A., Glaubitz, J.C., et al. (2009). The genetic

architecture of maize flowering time. Science 325, 714-718.

Bull, H., Casao, M.C., Zwirek, M., Flavell, A.J., Thomas, W.T.B., Guo, W., Zhang,

R., Rapazote-Flores, P., Kyriakidis, S., Russell, J., et al. (2017). Barley SIX-

ROWED SPIKE3 encodes a putative Jumonji C-type H3K9me2/me3 demethylase

that represses lateral spikelet fertility. Nat Commun 8, 936.

Burset, M., Seledtsov, I.A., and Solovyev, V.V. (2000). Analysis of canonical and

non-canonical splice sites in mammalian genomes. Nucleic Acids Res 28, 4364-

4375.

Buschges, R., Hollricher, K., Panstruga, R., Simons, G., Wolter, M., Frijters, A., van

Daelen, R., van der Lee, T., Diergaarde, P., Groenendijk, J., et al. (1997). The barley

Mlo gene: a novel control element of plant pathogen resistance. Cell 88, 695-705.

Carollo, V., Matthews, D.E., Lazo, G.R., Blake, T.K., Hummel, D.D., Lui, N., Hane,

D.L., and Anderson, O.D. (2005). GrainGenes 2.0. an improved resource for the

small-grains community. Plant Physiol 139, 643-651.

Chalhoub, B., Denoeud, F., Liu, S., Parkin, I.A., Tang, H., Wang, X., Chiquet, J.,

Belcram, H., Tong, C., Samans, B., et al. (2014). Plant genetics. Early allopolyploid

evolution in the post-Neolithic Brassica napus oilseed genome. Science 345, 950-

953.

Chen, J., Ding, J., Ouyang, Y., Du, H., Yang, J., Cheng, K., Zhao, J., Qiu, S., Zhang,

X., Yao, J., et al. (2008). A triallelic system of S5 is a major regulator of the

150

reproductive barrier and compatibility of indica-japonica hybrids in rice. Proc Natl

Acad Sci U S A 105, 11436-11441.

Chevalier, F., Nieminen, K., Sanchez-Ferrero, J.C., Rodriguez, M.L., Chagoyen, M.,

Hardtke, C.S., and Cubas, P. (2014). Strigolactone promotes degradation of

DWARF14, an alpha/beta hydrolase essential for strigolactone signaling in

Arabidopsis. Plant Cell 26, 1134-1150.

Cingolani, P., Platts, A., Wang le, L., Coon, M., Nguyen, T., Wang, L., Land, S.J.,

Lu, X., and Ruden, D.M. (2012). A program for annotating and predicting the effects

of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila

melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 6, 80-92.

Close, T.J., Bhat, P.R., Lonardi, S., Wu, Y., Rostoks, N., Ramsay, L., Druka, A.,

Stein, N., Svensson, J.T., Wanamaker, S., et al. (2009). Development and

implementation of high-throughput SNP genotyping in barley. BMC Genomics 10,

582.

Cockram, J., White, J., Zuluaga, D.L., Smith, D., Comadran, J., Macaulay, M., Luo,

Z., Kearsey, M.J., Werner, P., Harrap, D., et al. (2010). Genome-wide association

mapping to candidate polymorphism resolution in the unsequenced barley genome.

Proc Natl Acad Sci U S A 107, 21611-21616.

Collins, A.R. (2004). The comet assay for DNA damage and repair: principles,

applications, and limitations. Mol Biotechnol 26, 249-261.

151

Colmsee, C., Beier, S., Himmelbach, A., Schmutzer, T., Stein, N., Scholz, U., and

Mascher, M. (2015). BARLEX - the Barley Draft Genome Explorer. Mol Plant 8,

964-966.

Dai, F., Chen, Z.H., Wang, X., Li, Z., Jin, G., Wu, D., Cai, S., Wang, N., Wu, F.,

Nevo, E., et al. (2014). Transcriptome profiling reveals mosaic genomic origins of

modern cultivated barley. Proc Natl Acad Sci U S A 111, 13403-13408.

Dai, F., Nevo, E., Wu, D., Comadran, J., Zhou, M., Qiu, L., Chen, Z., Beiles, A.,

Chen, G., and Zhang, G. (2012). Tibet is one of the centers of domestication of

cultivated barley. Proc Natl Acad Sci U S A 109, 16969-16973.

Dai, F., Wang, X., Zhang, X.Q., Chen, Z., Nevo, E., Jin, G., Wu, D., Li, C., and

Zhang, G. (2018). Assembly and analysis of a qingke reference genome demonstrate

its close genetic relation to modern cultivated barley. Plant Biotechnol J 16, 760-770.

Dobin, A., Davis, C.A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P.,

Chaisson, M., and Gingeras, T.R. (2013). STAR: ultrafast universal RNA-seq

aligner. Bioinformatics 29, 15-21.

Domagalska, M.A., and Leyser, O. (2011). Signal integration in the control of shoot

branching. Nat Rev Mol Cell Biol 12, 211-221.

Druka, A., Franckowiak, J., Lundqvist, U., Bonar, N., Alexander, J., Houston, K.,

Radovic, S., Shahinnia, F., Vendramin, V., Morgante, M., et al. (2011). Genetic

dissection of barley morphology and development. Plant Physiol 155, 617-627.

152

Eid, J., Fehr, A., Gray, J., Luong, K., Lyle, J., Otto, G., Peluso, P ., Rank, D.,

Baybayan, P., Bettman, B., et al. (2009). Real-time DNA sequencing from single

polymerase molecules. Science 323, 133-138.

Fan, C., Xing, Y., Mao, H., Lu, T., Han, B., Xu, C., Li, X., and Zhang, Q. (2006).

GS3, a major QTL for grain length and weight and minor QTL for grain width and

thickness in rice, encodes a putative transmembrane protein. Theor Appl Genet 112,

1164-1171.

Fan, Y., Zhou, G., Shabala, S., Chen, Z.H., Cai, S., Li, C., and Zhou, M. (2016).

Genome-Wide Association Study Reveals a New QTL for Salinity Tolerance in

Barley (Hordeum vulgare L.). Front Plant Sci 7, 946.

Felsenstein, J. (1981). Evolutionary trees from DNA sequences: a maximum

likelihood approach. J Mol Evol 17, 368-376.

Friedberg, E.C. (2003). DNA damage and repair. Nature 421, 436-440.

Frommer, M., Mcdonald, L.E., Millar, D.S., Collis, C.M., Watt, F., Grigg, G.W.,

Molloy, P.L., and Paul, C.L. (1992). A Genomic Sequencing Protocol That Yields a

Positive Display of 5-Methylcytosine Residues in Individual DNA Strands. Proc Natl

Acad Sci U S A 89, 1827-1831.

Fuller, D.Q., and Allaby, R. (2009). Seed Dispersal and Crop Domestication:

Shattering, Germination and Seasonality in Evolution under Cultivation. In Fruit

Development and Seed Dispersal (Wiley-Blackwell), pp. 238-295.

153

Garrido-Cardenas, J.A., Garcia-Maroto, F., Alvarez-Bermejo, J.A., and Manzano-

Agugliaro, F. (2017). DNA Sequencing Sensors: An Overview. Sensors (Basel) 17.

Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S.,

Ellis, B., Gautier, L., Ge, Y., Gentry, J., et al. (2004). Bioconductor: open software

development for computational biology and bioinformatics. Genome Biol 5, R80.

Gigon, A., Matos, A.R., Laffray, D., Zuily-Fodil, Y., and Pham-Thi, A.T. (2004).

Effect of drought stress on lipid metabolism in the leaves of Arabidopsis thaliana

(ecotype Columbia). Ann Bot 94, 345-351.

Godfray, H.C., Beddington, J.R., Crute, I.R., Haddad, L., Lawrence, D., Muir, J.F.,

Pretty, J., Robinson, S., Thomas, S.M., and Toulmin, C. (2010). Food security: the

challenge of feeding 9 billion people. Science 327, 812-818.

Goff, S.A., Ricke, D., Lan, T.H., Presting, G., Wang, R., Dunn, M., Glazebrook, J.,

Sessions, A., Oeller, P., Varma, H., et al. (2002). A draft sequence of the rice

genome (Oryza sativa L. ssp. japonica). Science 296, 92-100.

Gomez-Roldan, V., Fermas, S., Brewer, P.B., Puech-Pages, V., Dun, E.A., Pillot,

J.P., Letisse, F., Matusova, R., Danoun, S., Portais, J.C., et al. (2008). Strigolactone

inhibition of shoot branching. Nature 455, 189-194.

Gonzalez-Guzman, M., Pizzio, G.A., Antoni, R., Vera-Sirera, F., Merilo, E., Bassel,

G.W., Fernandez, M.A., Holdsworth, M.J., Perez-Amador, M.A., Kollist, H., et al.

(2012). Arabidopsis PYR/PYL/RCAR receptors play a major role in quantitative

regulation of stomatal aperture and transcriptional response to abscisic acid. Plant

Cell 24, 2483-2496.

154

Graeber, K., Nakabayashi, K., Miatton, E., Leubner-Metzger, G., and Soppe, W.J.

(2012). Molecular mechanisms of seed dormancy. Plant Cell Environ 35, 1769-1786.

Greene, E.A., Codomo, C.A., Taylor, N.E., Henikoff, J.G., Till, B.J., Reynolds, S.H.,

Enns, L.C., Burtner, C., Johnson, J.E., Odden, A.R., et al. (2003). Spectrum of

chemically induced mutations from a large-scale reverse-genetic screen in

Arabidopsis. Genetics 164, 731-740.

Hadid, Y., Pavlíček, T., Beiles, A., Ianovici, R., Raz, S., and Nevo, E. (2014).

Sympatric incipient speciation of spiny mice Acomys at “Evolution Canyon,” Israel.


Hajjar, R., and Hodgkin, T. (2007). The use of wild relatives in crop improvement: a

survey of developments over the last 20 years. Euphytica 156, 1-13.

Han, F., Ullrich, S.E., Clancy, J.A., Jitkov, V., Kilian, A., and Romagosa, I. (1996).

Verification of barley seed dormancy loci via linked molecular markers. Theor Appl

Genet 92, 87-91.

Harlan, J.R., and Zohary, D. (1966). Distribution of wild wheats and barley. Science

153, 1074-1080.

Hinze, K., Thompson, R.D., Ritter, E., Salamini, F., and Schulze-Lefert, P. (1991).

Restriction fragment length polymorphism-mediated targeting of the ml-o resistance

locus in barley (Hordeum vulgare). Proc Natl Acad Sci U S A 88, 3691-3695.

Hofmann, N. (2014). Cryptochromes and seed dormancy: the molecular mechanism

of blue light inhibition of grain germination. Plant Cell 26, 846.

155

Holton, T.A. (2002). Identification and mapping of polymorphic SSR markers from

expressed gene sequences of barley and wheat. Mol Breed 9, 63-71.

Houston, K., McKim, S.M., Comadran, J., Bonar, N., Druka, I., Uzrek, N., Cirillo,

E., Guzy-Wrobelska, J., Collins, N.C., Halpin, C., et al. (2013). Variation in the

interaction between alleles of HvAPETALA2 and microRNA172 determines the

density of grains on the barley inflorescence. Proc Natl Acad Sci U S A 110, 16675-

16680.

Huang, S., Zhang, J., Li, R., Zhang, W., He, Z., Lam, T.W., Peng, Z., and Yiu, S.M.

(2011a). SOAPsplice: Genome-Wide ab initio Detection of Splice Junctions from

RNA-Seq Data. Front Genet 2, 46.

Huang, X., Kurata, N., Wei, X., Wang, Z.X., Wang, A., Zhao, Q., Zhao, Y., Liu, K.,

Lu, H., Li, W., et al. (2012a). A map of rice genome variation reveals the origin of

cultivated rice. Nature 490, 497-501.

Huang, X., Wei, X., Sang, T., Zhao, Q., Feng, Q., Zhao, Y., Li, C., Zhu, C., Lu, T.,

Zhang, Z., et al. (2010). Genome-wide association studies of 14 agronomic traits in

rice landraces. Nat Genet 42, 961-967.

Huang, X., Zhao, Y., Wei, X., Li, C., Wang, A., Zhao, Q., Li, W., Guo, Y., Deng, L.,

Zhu, C., et al. (2011b). Genome-wide association study of flowering time and grain

yield traits in a worldwide collection of rice germplasm. Nat Genet 44, 32-39.

Huang, X.H., Zhao, Y., Wei, X.H., Li, C.Y., Wang, A., Zhao, Q., Li, W.J., Guo,

Y.L., Deng, L.W., Zhu, C.R., et al. (2012b). Genome-wide association study of

156

flowering time and grain yield traits in a worldwide collection of rice germplasm.

Nat Genet 44, 32-U53.

Ibarra, S.E., Tognacca, R.S., Dave, A., Graham, I.A., Sanchez, R.A., and Botto, J.F.

(2016). Molecular mechanisms underlying the entrance in secondary dormancy of

Arabidopsis seeds. Plant Cell Environ 39, 213-221.

International Barley Genome Sequencing, C., Mayer, K.F., Waugh, R., Brown, J.W.,

Schulman, A., Langridge, P., Platzer, M., Fincher, G.B., Muehlbauer, G.J., Sato, K. ,

et al. (2012). A physical, genetic and functional sequence assembly of the barley

genome. Nature 491, 711-716.

International Rice Genome Sequencing, P. (2005). The map-based sequence of the

rice genome. Nature 436, 793-800.

International Wheat Genome Sequencing, C. (2014). A chromosome-based draft

sequence of the hexaploid bread wheat (Triticum aestivum) genome. Science 345,

1251788.

Ishikawa, S., Maekawa, M., Arite, T., Onishi, K., Takamure, I., and Kyozuka, J.

(2005). Suppression of tiller bud activity in tillering dwarf mutants of rice. Plant Cell

Physiol 46, 79-86.

Jaccoud, D., Peng, K., Feinstein, D., and Kilian, A. (2001). Diversity arrays: a solid

state technology for sequence information independent genotyping. Nucleic Acids

Res 29, E25.

157

Jia, Q.J., Tan, C., Wang, J.M., Zhang, X.Q., Zhu, J.H., Luo, H., Yang, J.M.,

Westcott, S., Broughton, S., Moody, D., et al. (2016). Marker development using

SLAF-seq and whole-genome shotgun strategy to fine-map the semi-dwarf gene ari-e

in barley. BMC Genomics 17, 911.

Jiang, L., Liu, X., Xiong, G., Liu, H., Chen, F., Wang, L., Meng, X., Liu, G., Yu, H.,

Yuan, Y., et al. (2013). DWARF 53 acts as a repressor of strigolactone signalling in

rice. Nature 504, 401-405.

Jiao, Y., Lee, Y.K., Gladman, N., Chopra, R., Christensen, S.A., Regulski, M.,

Burow, G., Hayes, C., Burke, J., Ware, D., et al. (2018). MSD1 regulates pedicellate

spikelet fertility in sorghum through the jasmonic acid pathway. Nat Commun 9,

822.

Jiao, Y., Peluso, P., Shi, J., Liang, T., Stitzer, M.C., Wang, B., Campbell, M.S.,

Stein, J.C., Wei, X., Chin, C.S., et al. (2017). Improved maize reference genome

with single-molecule technologies. Nature 546, 524-527.

Jiao, Y., Wang, Y., Xue, D., Wang, J., Yan, M., Liu, G., Dong, G., Zeng, D., Lu, Z.,

Zhu, X., et al. (2010). Regulation of OsSPL14 by OsmiR156 defines ideal plant

architecture in rice. Nat Genet 42, 541-544.

Jin, F., Li, Y., Dixon, J.R., Selvaraj, S., Ye, Z., Lee, A.Y., Yen, C.A., Schmitt, A.D.,

Espinoza, C.A., and Ren, B. (2013). A high-resolution map of the three-dimensional

chromatin interactome in human cells. Nature 503, 290-294.

Kim, D., Langmead, B., and Salzberg, S.L. (2015). HISAT: a fast spliced aligner

with low memory requirements. Nat Methods 12, 357-360.

158

Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R., and Salzberg, S.L. (2013).

TopHat2: accurate alignment of transcriptomes in the presence of insertions,

deletions and gene fusions. Genome Biol 14, R36.

Kim, K.N., Cheong, Y.H., Grant, J.J., Pandey, G.K., and Luan, S. (2003). CIPK3, a

calcium sensor-associated protein kinase that regulates abscisic acid and cold signal

transduction in Arabidopsis. Plant Cell 15, 411-423.

Kim, Y., Schumaker, K.S., and Zhu, J.-K. (2006). EMS mutagenesis of Arabidopsis.

In Arabidopsis Protocols (Springer), pp. 101-103.

Kleinhofs, A., Kilian, A., Saghai Maroof, M.A., Biyashev, R.M., Hayes, P., Chen,

F.Q., Lapitan, N., Fenwick, A., Blake, T.K., Kanazin, V. , et al. (1993). A molecular,

isozyme and morphological map of the barley (Hordeum vulgare) genome. Theor

Appl Genet 86, 705-712.

Komatsuda, T., and Mano, Y. (2002). Molecular mapping of the intermedium spike-c

( int-c) and non-brittle rachis 1 ( btr1) loci in barley ( Hordeum vulgare L.). Theor

Appl Genet 105, 85-90.

Komatsuda, T., Maxim, P., Senthil, N., and Mano, Y. (2004). High-density AFLP

map of nonbrittle rachis 1 (btr1) and 2 (btr2) genes in barley (Hordeum vulgare L.).

Theor Appl Genet 109, 986-995.

Komatsuda, T., Pourkheirandish, M., He, C., Azhaguvel, P., Kanamori, H., Perovic,

D., Stein, N., Graner, A., Wicker, T., Tagiri, A., et al. (2007). Six-rowed barley

originated from a mutation in a homeodomain-leucine zipper I-class homeobox gene.


159

Koornneeff, M., Dellaert, L., and Van der Veen, J. (1982). EMS-and relation-

induced mutation frequencies at individual loci in Arabidopsis thaliana (L.) Heynh.

Mutat Res 93, 109-123.

Koski, L.B., Gray, M.W., Lang, B.F., and Burger, G. (2005). AutoFACT: An Auto

matic F unctional A nnotation and C lassification T ool. BMC Bioinformatics 6, 151.

Li, G., Chern, M., Jain, R., Martin, J.A., Schackwitz, W.S., Jiang, L., Vega-Sanchez,

M.E., Lipzen, A.M., Barry, K.W., Schmutz, J., et al. (2016a). Genome-Wide

Sequencing of 41 Rice (Oryza sativa L.) Mutated Lines Reveals Diverse Mutations

Induced by Fast-Neutron Irradiation. Mol Plant 9, 1078-1081.

Li, G., Jain, R., Chern, M., Pham, N.T., Martin, J.A., Wei, T., Schackwitz, W.S.,

Lipzen, A.M., Duong, P.Q., Jones, K.C., et al. (2017). The Sequences of 1504

Mutants in the Model Rice Variety Kitaake Facilitate Rapid Functional Genomic

Studies. Plant Cell 29, 1218-1231.

Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with

BWA-MEM. arXiv preprint arXiv:13033997.

Li, H. (2014). Toward better understanding of artifacts in variant calling from high-

coverage samples. Bioinformatics 30, 2843-2851.

Li, H., and Durbin, R. (2010). Fast and accurate long-read alignment with Burrows-

Wheeler transform. Bioinformatics 26, 589-595.

160

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G.,

Abecasis, G., Durbin, R., and Genome Project Data Processing, S. (2009). The

Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079.

Li, J.Y., Wang, J., and Zeigler, R.S. (2014a). The 3,000 rice genomes project: new

opportunities and challenges for future rice research. Gigascience 3, 8.

Li, K., Wang, H., Cai, Z., Wang, L., Xu, Q., Lovy, M., Wang, Z., and Nevo, E.

(2016b). Sympatric speciation of spiny mice, Acomys, unfolded transcriptomically at

Evolution Canyon, Israel. Proc Natl Acad Sci U S A 113, 8254-8259.

Li, Q., Shao, J., Tang, S., Shen, Q., Wang, T., Chen, W., and Hong, Y. (2015).

Corrigendum: Wrinkled1 Accelerates Flowering and Regulates Lipid Homeostasis

between Oil Accumulation and Membrane Lipid Anabolism in Brassica napus. Front

Plant Sci 6, 1270.

Li, S., Zheng, Y.C., Cui, H.R., Fu, H.W., Shu, Q.Y., and Huang, J.Z. (2016c).

Frequency and type of inheritable mutations induced by gamma rays in rice as

revealed by whole genome sequencing. J Zhejiang Univ Sci B 17, 905-915.

Li, X., Qian, Q., Fu, Z., Wang, Y., Xiong, G., Zeng, D., Wang, X., Liu, X., Teng, S.,

Hiroshi, F., et al. (2003). Control of tillering in rice. Nature 422, 618-621.

Li, Y., Fan, C., Xing, Y., Jiang, Y., Luo, L., Sun, L., Shao, D., Xu, C., Li, X., Xiao,

J., et al. (2011). Natural variation in GS5 plays an important role in regulating grain

size and yield in rice. Nat Genet 43, 1266-1269.

161

Li, Y., Fan, C., Xing, Y., Yun, P., Luo, L., Yan, B., Peng, B., Xie, W., Wang, G., Li,

X., et al. (2014b). Chalk5 encodes a vacuolar H(+)-translocating pyrophosphatase

influencing grain chalkiness in rice. Nat Genet 46, 398-404.

Lin, H., Wang, R., Qian, Q., Yan, M., Meng, X., Fu, Z., Yan, C., Jiang, B., Su, Z.,

Li, J., et al. (2009). DWARF27, an iron-containing protein required for the

biosynthesis of strigolactones, regulates rice tiller bud outgrowth. Plant Cell 21,

1512-1525.

Lin, T., Zhu, G., Zhang, J., Xu, X., Yu, Q., Zheng, Z., Zhang, Z., Lun, Y., Li, S.,

Wang, X., et al. (2014). Genomic analyses provide insights into the history of tomato

breeding. Nat Genet 46, 1220-1226.

Liu, Z.W., Biyashev, R.M., and Maroof, M.A. (1996). Development of simple

sequence repeat DNA markers and their integration into a barley linkage map. Theor

Appl Genet 93, 869-876.

Long, Y., Zhao, L., Niu, B., Su, J., Wu, H., Chen, Y., Zhang, Q., Guo, J., Zhuang, C.,

Mei, M., et al. (2008). Hybrid male sterility in rice controlled by interaction between

divergent alleles of two adjacent genes. Proc Natl Acad Sci U S A 105, 18871-

18876.

Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., He, G., Chen, Y., Pan, Q., Liu,

Y., et al. (2012). SOAPdenovo2: an empirically improved memory-efficient short-

read de novo assembler. Gigascience 1, 18.

Mardis, E.R. (2008). Next-generation DNA sequencing methods. Annu Rev

Genomics Hum Genet 9, 387-402.

162

Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput

sequencing reads. EMBnet J 17, 10-12.

Mascher, M., Gundlach, H., Himmelbach, A., Beier, S., Twardziok, S.O., Wicker, T.,

Radchuk, V., Dockter, C., Hedley, P.E., Russell, J., et al. (2017). A chromosome

conformation capture ordered sequence of the barley genome. Nature 544, 427-433.

Mascher, M., Jost, M., Kuon, J.E., Himmelbach, A., Assfalg, A., Beier, S., Scholz,

U., Graner, A., and Stein, N. (2014). Mapping-by-sequencing accelerates forward

genetics in barley. Genome Biol 15, R78.

Mascher, M., Muehlbauer, G.J., Rokhsar, D.S., Chapman, J., Schmutz, J., Barry, K.,

Munoz-Amatriain, M., Close, T.J., Wise, R.P., Schulman, A.H. , et al. (2013a).

Anchoring and ordering NGS contig assemblies by population sequencing

(POPSEQ). Plant J 76, 718-727.

Mascher, M., Richmond, T.A., Gerhardt, D.J., Himmelbach, A., Clissold, L.,

Sampath, D., Ayling, S., Steuernagel, B., Pfeifer, M., D'Ascenzo, M. , et al. (2013b).

Barley whole exome capture: a tool for genomic research in the genus Hordeum and

beyond. Plant J 76, 494-505.

Maxam, A.M., and Gilbert, W. (1977). A new method for sequencing DNA. Proc

Natl Acad Sci U S A 74, 560-564.

Miura, K., Ikeda, M., Matsubara, A., Song, X.J., Ito, M., Asano, K., Matsuoka, M.,

Kitano, H., and Ashikari, M. (2010). OsSPL14 promotes panicle branching and

higher grain productivity in rice. Nat Genet 42, 545-549.

163

Mizuta, Y., Harushima, Y., and Kurata, N. (2010). Rice pollen hybrid

incompatibility caused by reciprocal gene loss of duplicated genes. Proc Natl Acad

Sci U S A 107, 20417-20422.

Molina-Cano, J.L., Russell, J.R., Moralejo, M.A., Escacena, J.L., Arias, G., and

Powell, W. (2005). Chloroplast DNA microsatellite analysis supports a polyphyletic

origin for barley. Theor Appl Genet 110, 613-619.

Monna, L. (2002). Positional Cloning of Rice Semidwarfing Gene, sd-1: Rice "Green

Revolution Gene" Encodes a Mutant Enzyme Involved in Gibberellin Synthesis.

DNA Res 9, 11-17.

Morrell, P.L., and Clegg, M.T. (2007). Genetic evidence for a second domestication

of barley (Hordeum vulgare) east of the Fertile Crescent. Proc Natl Acad Sci U S A

104, 3289-3294.

Munoz-Amatriain, M., Lonardi, S., Luo, M., Madishetty, K., Svensson, J.T.,

Moscou, M.J., Wanamaker, S., Jiang, T., Kleinhofs, A., Muehlbauer, G.J., et al.

(2015). Sequencing of 15 622 gene-bearing BACs clarifies the gene-dense regions of

the barley genome. Plant J 84, 216-227.

NA, J., and JN, F. (2011). Sickle: A sliding-window, adaptive, quality-based

trimming tool for FastQ files.

Nakamura, H., Xue, Y.L., Miyakawa, T., Hou, F., Qin, H.M., Fukui, K., Shi, X., Ito,

E., Ito, S., Park, S.H., et al. (2013). Molecular mechanism of strigolactone perception

by DWARF14. Nat Commun 4, 2613.

164

Nakamura, S., Pourkheirandish, M., Morishige, H., Kubo, Y., Nakamura, M.,

Ichimura, K., Seo, S., Kanamori, H., Wu, J., Ando, T., et al. (2016). Mitogen-

Activated Protein Kinase Kinase 3 Regulates Seed Dormancy in Barley. Curr Biol

26, 775-781.

Nakata, M., Miyashita, T., Kimura, R., Nakata, Y., Takagi, H., Kuroda, M.,

Yamaguchi, T., Umemoto, T., and Yamakawa, H. (2018). MutMapPlus identified

novel mutant alleles of a rice starch branching enzyme IIb gene for fine-tuning of

cooked rice texture. Plant Biotechnol J 16, 111-123.

Nevo, E. (1995). Asian, African and European Biota Meet at 'Evolution Canyon'

Israel: Local Tests of Global Biodiversity and Genetic Diversity Patterns. Proc R Soc

Lond B Biol Sci 262, 149-155.

Nevo, E. (2006). Evolution canyon": A microcosm of life's evolution focusing on

adaptation and speciation. Isr J Ecol Evol 52, 485-506.

Nevo, E. (2012). "Evolution Canyon," a potential microscale monitor of global

warming across life. Proc Natl Acad Sci U S A 109, 2960-2965.

Nevo, E. (2013). Darwinian Evolution: Evolution in Action Across Life at

"Evolution Canyon", Israel. Isr J Ecol Evol 55, 215-225.

Nevo, E. (2014a). Evolution in action: adaptation and incipient sympatric speciation

with gene flow across life at “Evolution Canyon”, Israel. Isr J Ecol Evol 60, 85-98.

165

Nevo, E. (2014b). Evolution of wild barley at “Evolution Canyon”: adaptation,

speciation, pre-agricultural collection, and barley improvement. Isr J Plant Sci 62,

22-32.

Nevo, E., Shewry, P.R., and others (1992). Origin, evolution, population genetics and

resources for breeding of wild barley, Hordeum spontaneum, in the Fertile Crescent

(Wallingford, UK: CAB International).

Nordstrom, K.J., Albani, M.C., James, G.V., Gutjahr, C., Hartwig, B., Turck, F.,

Paszkowski, U., Coupland, G., and Schneeberger, K. (2013a). Mutation

identification by direct comparison of whole-genome sequencing data from mutant

and wild-type individuals using k-mers. Nat Biotechnol 31, 325-330.

Nordstrom, K.J.V., Albani, M.C., James, G.V., Gutjahr, C., Hartwig, B., Turck, F.,

Paszkowski, U., Coupland, G., and Schneeberger, K. (2013b). Mutation

identification by direct comparison of whole-genome sequencing data from mutant

and wild-type individuals using k-mers. Nat Biotechnol 31, 325-+.

Parnas, T. (2006). Evidence for Incipient Sympatric Speciation in Wild Barley

Hordeum Spontaneum, at" Evolution Canyon", Mount Carmel, Israel, Based on

Hybridization and Physiological and Genetic Diversity Estimates (University of

Haifa, Faculty of Science and Science Education, Department of Evolutionary and

Environmental Biology).

Pasam, R.K., Sharma, R., Malosetti, M., van Eeuwijk, F.A., Haseneyer, G., Kilian,

B., and Graner, A. (2012). Genome-wide association studies for agronomical traits in

a world wide spring barley collection. BMC Plant Biol 12, 16.

166

Paterson, A.H., Bowers, J.E., Bruggmann, R., Dubchak, I., Grimwood, J., Gundlach,

H., Haberer, G., Hellsten, U., Mitros, T., Poliakov, A., et al. (2009). The Sorghum

bicolor genome and the diversification of grasses. Nature 457, 551-556.

Pavlícek, T., Sharon, D., Kravchenko, V., Saaroni, H., and Nevo, E. (2003).

Microclimatic interslope differences underlying biodiversity contrasts in" Evolution

Canyon", Mt. Carmel, Israel. Isr J Earth Sci 52.

Pourkheirandish, M., Hensel, G., Kilian, B., Senthil, N., Chen, G., Sameri, M.,

Azhaguvel, P., Sakuma, S., Dhanagond, S., Sharma, R., et al. (2015). Evolution of

the Grain Dispersal System in Barley. Cell 162, 527-539.

Ramsay, L., Macaulay, M., degli Ivanissevich, S., MacLean, K., Cardle, L., Fuller,

J., Edwards, K.J., Tuvesson, S., Morgante, M., Massari, A., et al. (2000). A simple

sequence repeat-based linkage map of barley. Genetics 156, 1997-2005.

Rastogi, R.P., Richa, Kumar, A., Tyagi, M.B., and Sinha, R.P. (2010). Molecular

mechanisms of ultraviolet radiation-induced DNA damage and repair. J Nucleic

Acids 2010, 592980.

Ren, X., Wang, J., Liu, L., Sun, G., Li, C., Luo, H., and Sun, D. (2016). SNP -based

high density genetic map and mapping of btwd1 dwarfing gene in barley. Sci Rep 6,

31741.

Riboni, M., Galbiati, M., Tonelli, C., and Conti, L. (2013). GIGANTEA enables

drought escape response via abscisic acid-dependent activation of the florigens and

SUPPRESSOR OF OVEREXPRESSION OF CONSTANS. Plant Physiol 162, 1706-

1719.

167

Richards, J.K., Friesen, T.L., and Brueggeman, R.S. (2017). Association mapping

utilizing diverse barley lines reveals net form net blotch seedling

resistance/susceptibility loci. Theor Appl Genet 130, 915-927.

Robinson, M.D., McCarthy, D.J., and Smyth, G.K. (2010). edgeR: a Bioconductor

package for differential expression analysis of digital gene expression data.

Bioinformatics 26, 139-140.

Robinson, M.D., and Oshlack, A. (2010). A scaling normalization method for

differential expression analysis of RNA-seq data. Genome Biol 11, R25.

Roy, J.K., Smith, K.P., Muehlbauer, G.J., Chao, S., C lose, T.J., and Steffenson, B.J.

(2010). Association mapping of spot blotch resistance in wild barley. Mol Breed 26,

243-256.

Russell, J., Mascher, M., Dawson, I.K., Kyriakidis, S., Calixto, C., Freund, F., Bayer,

M., Milne, I., Marshall-Griffiths, T., Heinen, S., et al. (2016). Exome sequencing of

geographically diverse barley landraces and wild relatives gives insights into

environmental adaptation. Nat Genet 48, 1024-1030.

Russell, J., van Zonneveld, M., Dawson, I.K., Booth, A., Waugh, R., and Steffenson ,

B. (2014). Genetic diversity and ecological niche modelling of wild barley: refugia,

large-scale post-LGM range expansion and limited mid-future climate threats? PLoS

One 9, e86021.

Rutger, J.N., Peterson, M.L., and Hu, C. (1977). Registration of Calrose 76 Rice1

(Reg. No. 45). Crop Sci 17, 978-978.

168

Sallam, A.H., Tyagi, P., Brown-Guedira, G., Muehlbauer, G.J., Hulse, A., and

Steffenson, B.J. (2017). Genome-Wide Association Mapping of Stem Rust

Resistance in Hordeum vulgare subsp. spontaneum. G3 (Bethesda) 7, 3491-3507.

Sanger, F., and Coulson, A.R. (1975). A rapid method for determining sequences in

DNA by primed synthesis with DNA polymerase. J Mol Biol 94, 441-448.

Sanger, F., Nicklen, S., and Coulson, A.R. (1977). DNA sequencing with chain-

terminating inhibitors. Proc Natl Acad Sci U S A 74, 5463-5467.

Sankaranarayanan, K. (1991). Ionizing radiation and genetic risks. III. Nature of

spontaneous and radiation-induced mutations in mammalian in vitro systems and

mechanisms of induction of mutations by radiation. Mutat Res 258, 75-97.

Sato, K., Yamane, M., Yamaji, N., Kanamori, H., Tagiri, A., Schwerdt, J.G., Fincher,

G.B., Matsumoto, T., Takeda, K., and Komatsuda, T. (2016). Alanine

aminotransferase controls seed dormancy in barley. Nat Commun 7, 11625.

Savolainen, V., Anstett, M.C., Lexer, C., Hutton, I., Clarkson, J.J., Norup, M.V.,

Powell, M.P., Springate, D., Salamin, N., and Baker, W.J. (2006). Sympatric

speciation in palms on an oceanic island. Nature 441, 210-213.

Schmutz, J. , Cannon, S.B., Schlueter, J., Ma, J., Mitros, T., Nelson, W., Hyten, D.L.,

Song, Q., Thelen, J.J., Cheng, J., et al. (2010). Erratum: Genome sequence of the

palaeopolyploid soybean. Nature 465, 120-120.

169

Schnable, P.S., Ware, D., Fulton, R.S., Stein, J.C., Wei, F., Pasternak, S., Liang, C.,

Zhang, J., Fulton, L., Graves, T.A., et al. (2009). The B73 maize genome:

complexity, diversity, and dynamics. Science 326, 1112-1115.

Seberg, O., and Petersen, G. (2009). A unified classification system for eukaryotic

transposable elements should reflect their phylogeny. Nat Rev Genet 10, 276.

Sega, G.A. (1984). A review of the genetic effects of ethyl methanesulfonate. Mutat

Res 134, 113-142.

Sherman, J.D., Fenwick, A.L., Namuth, D.M., and Lapitan, N.L. (1995). A barley

RFLP map: alignment of three barley maps and comparisons to Gramineae species.

Theor Appl Genet 91, 681-690.

Sherry, S.T., Ward, M.H., Kholodov, M., Baker, J., Phan, L., Smigielski, E.M., and

Sirotkin, K. (2001). dbSNP: the NCBI database of genetic variation. Nucleic Acids

Res 29, 308-311.

Shirley, B.W., Hanley, S., and Goodman, H.M. (1992). Effects of ionizing radiation

on a plant genome: analysis of two Arabidopsis transparent testa mutations. Plant

Cell 4, 333-347.

Sims, D., Sudbery, I., Ilott, N.E., Heger, A., and Ponting, C.P. (2014). Sequencing

depth and coverage: key considerations in genomic analyses. Nat Rev Genet 15, 121-

132.

Skinner, M.E., Uzilov, A.V., Stein, L.D., Mungall, C.J., and Holmes, I.H. (2009).

JBrowse: a next-generation genome browser. Genome Res 19, 1630-1638.

170

Song, X., Lu, Z., Yu, H., Shao, G., Xiong, J., Meng, X., Jing, Y., Liu, G., Xiong, G.,

Duan, J., et al. (2017). IPA1 functions as a downstream transcription factor repressed

by D53 in strigolactone signaling in rice. Cell Res 27, 1128-1141.

Strasburg, J.L., Sherman, N.A., Wright, K.M., Moyle, L.C., Willis, J.H., and

Rieseberg, L.H. (2012). What can patterns of differentiation across plant genomes

tell us about adaptation and speciation? Philos Trans R Soc Lond B Biol Sci 367,

364-373.

Struss, D., and Plieske, J. (1998). The use of microsatellite markers for detection of

genetic diversity in barley populations. Theor Appl Genet 97, 308-315.

Suprunova, T., Krugman, T., Fahima, T., Chen, G., Shams, I., Korol, A., and Nevo,

E. (2004). Differential expression of dehydrin (Dhn) in response to water stress in

resistant and sensitive wild barley (Hordeum spontaneum). Plant Cell Environ 27,

297-308.

Takagi, H., Abe, A., Yoshida, K., Kosugi, S., Natsume, S., Mitsuoka, C., Uemura,

A., Utsushi, H., Tamiru, M., Takuno, S., et al. (2013a). QTL-seq: rapid mapping of

quantitative trait loci in rice by whole genome resequencing of DNA from two

bulked populations. Plant J 74, 174-183.

Takagi, H., Uemura, A., Yaegashi, H., Tamiru, M., Abe, A., Mitsuoka, C., Utsushi,

H., Natsume, S., Kanzaki, H., Matsumura, H., et al. (2013b). MutMap-Gap: whole-

genome resequencing of mutant F2 progeny bulk combined with de novo assembly

of gap regions identifies the rice blast resistance gene Pii. New Phytol 200, 276-283.

171

Takahashi, R., and Hayashi, J. (1964). Linkage study of two complementary genes

for brittle rachis in barley. Berichte des Ohara Instituts für Landwirtschaftliche

Biologie, Okayama Universität 12, 99-105.

Takeno, K. (2016). Stress-induced flowering: the third category of flowering

response. J Exp Bot 67, 4925-4934.

Tamura, K., Stecher, G., Peterson, D., Filipski, A., and Kumar, S. (2013). MEGA6:

Molecular Evolutionary Genetics Analysis version 6.0. Mol Biol Evol 30, 2725-

2729.

Tanaka, M., Takei, K., Kojima, M., Sakakibara, H., and Mori, H. (2006). Auxin

controls local cytokinin biosynthesis in the nodal stem in apical dominance. Plant J

45, 1028-1036.

Tester, M., and Langridge, P. (2010). Breeding technologies to increase crop

production in a changing world. Science 327, 818-822.

Thiel, T., Michalek, W., Varshney, R.K., and Graner, A. (2003). Exploiting EST

databases for the development and characterization of gene-derived SSR-markers in

barley (Hordeum vulgare L.). Theor Appl Genet 106, 411-422.

Tollefsbol, T.O. (2011). Epigenetics : The new science of genetics. In Handbook of

Epigenetics (Elsevier), pp. 1-6.

Turuspekov, Y., Ormanbekova, D., Rsaliev, A., and Abugalieva, S. (2016). Genome-

wide association study on stem rust resistance in Kazakh spring barley lines. BMC

Plant Biol 16 Suppl 1, 6.

172

Umezawa, T., Yoshida, R., Maruyama, K., Yamaguchi-Shinozaki, K., and

Shinozaki, K. (2004). SRK2C, a SNF1-related protein kinase 2, improves drought

tolerance by controlling stress-responsive gene expression in Arabidopsis thaliana.


Varshney, R.K., Marcel, T.C., Ramsay, L., Russell, J. , Roder, M.S., Stein, N.,

Waugh, R., Langridge, P., Niks, R.E., and Graner, A. (2007). A high density barley

microsatellite consensus map with 775 SSR loci. Theor Appl Genet 114, 1091-1103.

Via, S., and West, J. (2008). The genetic mosaic suggests a new role for hitchhiking

in ecological speciation. Mol Ecol 17, 4334-4345.

Visel, A., Blow, M.J., Li, Z., Zhang, T., Akiyama, J.A., Holt, A., Plajzer-Frick, I.,

Shoukry, M., Wright, C., Chen, F., et al. (2009). ChIP-seq accurately predicts tissue-

specific activity of enhancers. Nature 457, 854-858.

Vogel, J.P., Garvin, D.F., Mockler, T.C., Schmutz, J., Rokhsar, D., Bevan, M.W.,

Barry, K., Lucas, S., Harmon-Smith, M., and Lail, K. (2010). Genome sequencing

and analysis of the model grass Brachypodium distachyon. Nature 463, 763-768.

Wang, J., Liu, L., Ball, T., Yu, L., Li, Y., and Xing, F. (2016). Revealing a 5,000-y-

old beer recipe in China. Proc Natl Acad Sci U S A 113, 6444-6448.

Wang, L., Wang, A., Huang, X., Zhao, Q., Dong, G., Qian, Q., Sang, T., and Han, B.

(2011). Mapping 49 quantitative trait loci at high resolution through sequencing-

based genotyping of rice recombinant inbred lines. Theor Appl Genet 122, 327-340.

173

Wang, M., Jiang, N., Jia, T., Leach, L., Cockram, J., Comadran, J., Shaw, P., Waugh,

R., and Luo, Z. (2012). Genome-wide association mapping of agronomic and

morphologic traits in highly structured populations of barley cultivars. Theor Appl

Genet 124, 233-246.

Wang, N., Long, T., Yao, W., Xiong, L., Zhang, Q., and Wu, C. (2013). Mutant

resources for the functional analysis of the rice genome. Mol Plant 6, 596-604.

Wang, R., Yang, F., Zhang, X.Q., Wu, D., Tan, C., Westcott, S., Broughton, S., Li,

C., Zhang, W., and Xu, Y. (2017). Characterization of a Thermo-Inducible

Chlorophyll-Deficient Mutant in Barley. Front Plant Sci 8, 1936.

Warr, A., Robert, C., Hume, D., Archibald, A., Deeb, N., and Watson, M. (2015).

Exome Sequencing: Current and Future Perspectives. G3 (Bethesda) 5, 1543-1550.

Wehner, G.G., Balko, C.C., Enders, M.M., Humbeck, K.K., and Ordon, F.F. (2015).

Identification of genomic regions involved in tolerance to drought stress and drought

stress induced leaf senescence in juvenile barley. BMC Plant Biol 15, 125.

Wenzl, P., Carling, J., Kudrna, D., Jaccoud, D., Huttner, E., Kleinhofs, A., and

Kilian, A. (2004). Diversity Arrays Technology (DArT) for whole-genome profiling

of barley. Proc Natl Acad Sci U S A 101, 9915-9920.

Wenzl, P., Li, H., Carling, J., Zhou, M., Raman, H., Paul, E., Hearnden, P., Maier,

C., Xia, L., Caig, V., et al. (2006). A high-density consensus map of barley linking

DArT markers to SSR, RFLP and STS loci and agricultural traits. BMC Genomics 7,

206.

174

Westberg, E., and Kadereit, J.W. (2014). Genetic evidence for divergent selection

onOenanthe conioidesandOe. aquatica(Apiaceae), a candidate case for sympatric

speciation. Biol J Linn Soc 113, 50-56.

Wu, C., Li, X., Yuan, W., Chen, G., Kilian, A., Li, J., Xu, C., Li, X., Zhou, D.X.,

Wang, S., et al. (2003). Development of enhancer trap lines for functional analysis of

the rice genome. Plant J 35, 418-427.

Wu, C., You, C., Li, C., Long, T., Chen, G., Byrne, M.E., and Zhang, Q. (2008).

RID1, encoding a Cys2/His2-type zinc finger transcription factor, acts as a master

switch from vegetative to floral development in rice. Proc Natl Acad Sci U S A 105,

12915-12920.

Wu, R. (1970). Nucleotide sequence analysis of DNA. I. Partial sequence of the

cohesive ends of bacteriophage lambda and 186 DNA. J Mol Biol 51, 501-521.

Wu, R., and Kaiser, A.D. (1968). Structure and base sequence in the cohesive ends of

bacteriophage lambda DNA. J Mol Biol 35, 523-537.

Xu, T. (1982). Origin and Evolution of Cultivated Barley in China [J]. Acta Genetica

Sinica 6.

Xue, W., Xing, Y., Weng, X., Zhao, Y., Tang, W., Wang, L., Zhou, H., Yu, S., Xu,

C., Li, X., et al. (2008). Natural variation in Ghd7 is an important regulator of

heading date and yield potential in rice. Nat Genet 40, 761-767.

Yamagata, Y., Yamamoto, E., Aya, K., Win, K.T., Doi, K., Sobrizal, Ito, T.,

Kanamori, H., Wu, J., Matsumoto, T., et al. (2010). Mitochondrial gene in the

175

nuclear genome induces reproductive barrier in rice. Proc Natl Acad Sci U S A 107,

1494-1499.

Yan, H., Saika, H., Maekawa, M., Takamure, I., Tsutsumi, N., Kyozuka, J., and

Nakazono, M. (2007). Rice tillering dwarf mutant dwarf3 has increased leaf

longevity during darkness-induced senescence or hydrogen peroxide-induced cell

death. Genes Genet Syst 82, 361-366.

Yan, W., Liu, H., Zhou, X., Li, Q., Zhang, J., Lu, L., Liu, T., Liu, H., Zhang, C.,

Zhang, Z., et al. (2014). Natural variation in Ghd7. 1 plays an important role in grain

yield and adaptation in rice. Cell Res 23, 2013-2023.

Yang, J., Zhao, X., Cheng, K., Du, H., Ouyang, Y., Chen, J., Q iu, S., Huang, J.,

Jiang, Y., Jiang, L., et al. (2012). A killer-protector system regulates both hybrid

sterility and segregation distortion in rice. Science 337, 1336-1340.

Yang, Z., Zhang, T., Bolshoy, A., Beharav, A., and Nevo, E. (2009). Adaptive

microclimatic structural and expressional dehydrin 1 evolution in wild barley,

Hordeum spontaneum, at 'Evolution Canyon', Mount Carmel, Israel. Mol Ecol 18,

2063-2075.

Yano, K., Yamamoto, E., Aya, K., Takeuchi, H., Lo, P.C., Hu, L., Yamasaki, M.,

Yoshida, S., Kitano, H., Hirano, K., et al. (2016). Genome-wide association study

using whole-genome sequencing rapidly identifies new genes influencing agronomic

traits in rice. Nat Genet 48, 927-934.

176

Yao, W., Li, G., Yu, Y., and Ouyang, Y. (2018). funRiceGenes dataset for

comprehensive understanding and application of rice functional genes. Gigascience

7, 1-9.

Yonemaru, J., Ebana, K., and Yano, M. (2014). HapRice, an SNP haplotype database

and a web tool for rice. Plant Cell Physiol 55, e9.

Youssef, H.M., Eggert, K., Koppolu, R., Alqudah, A.M., Poursarebani, N., Fazeli,

A., Sakuma, S., Tagiri, A., Rutten, T., Govind, G., et al. (2017). VRS2 regulates

hormone-mediated inflorescence patterning in barley. Nat Genet 49, 157-161.

Zeng, D., Tian, Z., Rao, Y., Dong, G., Yang, Y., Huang, L., Leng, Y., Xu, J., Sun,

C., Zhang, G., et al. (2017). Rational design of high-yield and superior-quality rice.

Nat Plants 3, 17031.

Zeng, X., Long, H., Wang, Z., Zhao, S., Tang, Y., Huang, Z., Wang, Y., Xu, Q.,

Mao, L., Deng, G., et al. (2015). The draft genome of Tibetan hulless barley reveals

adaptive patterns to the high stressful Tibetan P lateau. Proc Natl Acad Sci U S A

112, 1095-1100.

Zhang, G., Liu, X., Quan, Z., Cheng, S., Xu, X., Pan, S., Xie, M., Zeng, P., Yue, Z.,

Wang, W., et al. (2012). Genome sequence of foxtail millet (Setaria italica) provides

insights into grass evolution and biofuel potential. Nat Biotechnol 30, 549-554.

Zhang, J., Guo, D., Chang, Y., You, C., Li, X., Dai, X., Weng, Q., Zhang, J., Chen,

G., Li, X., et al. (2007). Non-random distribution of T-DNA insertions at various

levels of the genome hierarchy as revealed by analyzing 13 804 T-DNA flanking

sequences from an enhancer-trap mutant library. Plant J 49, 947-959.

177

Zhang, J., Li, C., Wu, C., Xiong, L., Chen, G., Zhang, Q., and Wang, S. (2006a).

RMD: a rice mutant database for functional analysis of the rice genome. Nucleic

Acids Res 34, D745-748.

Zhang, X., Yazaki, J., Sundaresan, A., Cokus, S., Chan, S.W., Chen, H., Henderson,

I.R., Shinn, P., Pellegrini, M., Jacobsen, S.E., et al. (2006b). Genome-wide high-

resolution mapping and functional analysis of DNA methylation in arabidopsis. Cell

126, 1189-1201.

Zhao, H., Yao, W., Ouyang, Y., Yang, W., Wang, G., Lian, X., Xing, Y., Chen, L.,

and Xie, W. (2015). RiceVarMap: a comprehensive database of rice genomic

variations. Nucleic Acids Res 43, D1018-1022.

Zhao, Y. (2010). Auxin biosynthesis and its role in plant development. Annu Rev

Plant Biol 61, 49-64.

Zhou, F., Lin, Q., Zhu, L., Ren, Y., Zhou, K., Shabek, N., Wu, F., Mao, H., Dong,

W., Gan, L., et al. (2013). D14-SCF(D3)-dependent degradation of D53 regulates

strigolactone signalling. Nature 504, 406-410.

Zhou, G., Broughton, S., Zhang, X.Q., Ma, Y., Zhou, M., and Li, C. (2016).

Genome-Wide Association Mapping of Acid Soil Resistance in Barley (Hordeum

vulgare L.). Front Plant Sci 7, 406.

Zhou, G., Zhang, Q., Zhang, X.Q., Tan, C., and Li, C. (2015). Construction of High-

Density Genetic Map in Barley through Restriction-Site Associated DNA

Sequencing. PLoS One 10, e0133161.

178

Zilberman, D., Gehring, M., Tran, R.K., Ballinger, T., and Henikoff, S. (2007).

Genome-wide analysis of Arabidopsis thaliana DNA methylation uncovers an

interdependence between methylation and transcription. Nat Genet 39, 61-69.

Zohary, D., Hopf, M., and Weiss, E. (2012). Domestication of Plants in the Old

World: The origin and spread of domesticated plants in Southwest Asia, Europe, and

the Mediterranean Basin (Oxford University Press on Demand).

Zou, J., Zhang, S., Zhang, W., Li, G., Chen, Z., Zhai, W., Zhao, X., Pan, X., Xie, Q.,

and Zhu, L. (2006). The rice HIGH-TILLERING DWARF1 encoding an ortholog of

Arabidopsis MAX3 is required for negative regulation of the outgrowth of axillary

buds. Plant J 48, 687-698.

179

Appendix 1 - Bioinformatic Pipeline

#################################################################

########

## Creater: Cong Tan

## Contact: [email protected]

## Date: 08-July-2018

#################################################################

########

Whole genome sequencing SNP and InDel analysis pipeline

#################################################################

##########

## part1: sequencing result control and filtration

## trimming

rule clean_sickle:

message: """------ start cleaning ------"""

input: "raw/{smp}_1.clean.fq.gz",

"raw/{smp}_2.clean.fq.gz"

output: "clean/{smp}_1.fq.gz",

"clean/{smp}_2.fq.gz",

"clean/{smp}_0.fq.gz"

180

params: "-l 80 -q 30 -t sanger -g"

log: "logs/clean/{smp}.log"

shell: """

sickle pe {params} -f {input[0]} -r {input[1]} -o {output[0]} -p

{output[1]} -s {output[2]} &>{log}

"""

## fastqc

rule QA_fastqc:

input: "clean/{smp}_1.fq.gz",

"clean/{smp}_2.fq.gz",

"clean/{smp}_0.fq.gz"

output: "clean/fastqc/{smp}_1_fastqc.zip",

"clean/fastqc/{smp}_2_fastqc.zip"

"clean/fastqc/{smp}_0_fastqc.zip"

params: " -o clean/fastqc/ -f fastq --extract "

log: "logs/fastqc/{smp}.log"

shell: """

fastqc {params} {input[0]} {input[1]} {input[2]} 2>{log}

"""

#################################################################

########################################3

## part2: mapping reads to the reference

## mapping for each lane

rule mapping_pe:

input: fwd="clean/{smp}/{id}_1.fq.gz",

181

rev="clean/{smp}/{id}_2.fq.gz"

params: " -M -R '@RG\tID:{id}\tSM:{smp}' "

output: pe="mapped/{smp}/{id}.pe.bam"

log:"logs/bwa/{smp}/{id}.bwa.log"

threads:16

shell:"""

bwa mem {params} -t {threads} {REF} {input.fwd} {input.rev} |\

samtools view -bSh -o {output.pe}

"""

rule mapping_se:

input:se="clean/{smp}/{id}_0.fq.gz"

params: " -M -R '@RG\tID:{id}\tSM:{smp}' "

output: se="mapped/{smp}/{id}.se.bam"

threads: 16

shell: """

bwa mem {params} -t {threads} {REF} {input.se} |\

samtools view -bhS -o {output.se}

"""

## merge PE and SE for each lane

rule merge_id:

input: pe="mapped/{smp}/{id}.pe.bam",

se="mapped/{smp}/{id}.se.bam"

output:bam="mapped/{smp}/{id}.bam"

log: "logs/samtools/{smp}/{id}.merge.log"

shell:"""

samtools merge -f {output.bam} {input.pe} {input.se} && \

rm {input.pe} {input.se}

182

"""

## sort for each lane

rule sort_samtools:

input: "mapped/{smp}/{id}.bam"

params: "-T mapped/{smp}/{id}"

output: "mapped/{smp}/{id}.srt.bam"

log: "logs/samtools/{smp}/{id}.sort.log"

threads:16

shell:"""

samtools sort {params} -o {output} {input} \

&& rm {input} 2>{log}

"""

rule rmdup:

input:"mapped/{smp}/{id}.srt.bam"

output: "mapped/{smp}/{id}.rmdup.bam"

log:"logs/rmdup_{smp}_{id}.log"

shell:" samtools rmdup {input} {output} && rm {input} 2>{log} "

## merge lanes for each sample

rule bam_list:

input:expand("mapped/{smp}/{id}.rmdup.bam",zip,smp=SAMPLES,id=IDS)

output:expand("mapped/{smp}/{smp}.bam.txt",smp=list(set(SAMPLES)))

run:

for one in input:

ones = one.split("/")

183

ofile = open("/".join([ones[0],ones[1],ones[1]])+".bam.txt","a")

ofile.write(one+"\n")

rule merge_sample:

input:"mapped/{smp}/{smp}.bam.txt"

output:"mapped/{smp}/{smp}.bam"

run:

ifile = open(input[0]).readlines()

if len(ifile) == 1:

one = ifile[0].strip()

ones = ifile[0].split("/")

two = "/".join([ones[0],ones[1],ones[1]]) + ".bam"

os.rename(one,two)

elif len(ifile) > 1:

cmd="samtools merge -b" + input[0] + " " + output[0]

os.system(cmd)

rule bam_index:

input:"mapped/{smp}/{smp}.bam"

output:"mapped/{smp}/{smp}.bam.bai"

log:"logs/samtools/samtools_index_{smp}.log"

shell:"samtools index {input}"

#################################################################

#########

## part3: SNP and InDel calling by bcftools (round 1)

184

## variants calling

rule call_bcftools:

input:"bam/{smp}.rmdup.bam"

output:"bcf/{smp}.raw.bcf"

log: "logs/bcf/bcf-call.log"

params: sam="-C50 -Q20 -m3 -t AD,DP -u ",bcf="-m -O b"

threads: 8

shell: """

samtools mpileup {params[sam]} -f {ref} -b {input[0]} |\

bcftools call {params[bcf]} -o {output[0]} 2>{log}

"""

## variants filtering

rule filter_bcftools:

input:"bcf/{smp}.raw.bcf"

output:"bcf/{smp}.flt.bcf"

log:"logs/bcf/filter.log"

params:"-e 'QUAL<50||MQ<40 ' -O b "

shell:" bcftools filter -o {input} {output} 2>{log}"

## variants result VCF file index

rule index_bcftools:

input:"bcf/{smp}.flt.bcf"

output:"bcf/{smp}.flt.bcf.csi"

log:"logs/bcf/index.log"

shell:"bcftools index {input} 2>{log}"

185

#################################################################

#########

## part4: SNP and InDel calling by GATK (round 2 and 3)

##gatk-target_creator

rule target_creator:

input:"mapped/{smp}/{id}.rmdup.bam"

output:"mapped/{smp}/{id}.intervals"

threads:8

shell:"""

gatk RealignerTargetCreator \

-I {input} -R {ref} -o {output} \

-fixMisencodedQuals -nt 8

"""

## gatk-realign

rule realign2:

input:"mapped/{smp}/{id}.rmdup.bam","mapped/{smp}/{id}.intervals"

output:"mapped/{smp}/{id}.realign.bam"

threads:8

shell:"""

gatk IndelRealigner \

-R {ref} \

-I {input[0]} \

-targetIntervals {input[1]} \

-o {output}

"""

186

## gatk-calling

rule call-gatk:

input:"bam/{smp}_realign.bam"

output:"gatk/{smp}_raw.g.vcf"

threads:8

shell:"""

gatk HaplotypeCaller

-R {ref} \

-I {input} \

--emitRefConfidence GVCF \

--variant_index_type LINEAR --variant_index_parameter 128000 \

--genotyping_mode DISCOVERY \

-stand_emit_conf 10 -stand_call_conf 30 \

-o {output} 2

"""

## gatk-genotype

rule genotype:

input:"gvcf.list"

output:"genotype.vcf"

threads:30

shell:"""

gatk GenotypeGVCFs \

-R {ref} --variant {input} -o {output}

"""

187

Appendix 2 - List of publications

1. Mascher M, (… Tan C) et al. A chromosome conformation capture ordered

sequence of the barley genome. Nature. 2017; doi:10.1038/nature22043.

2. Beier S, (…Tan C) et al. Construction of a map-based reference genome

sequence for barley, Hordeum vulgare L. Sci data. 2017;

doi:10.1038/sdata.2017.44.

3. Prade VM, (… Tan C) et al. The pseudogenes of barley. Plant J. 2017;

doi:10.1111/tpj.13794.

4. Wang R, (… Tan C) et al. Characterization of a Thermo-Inducible Chlorophyll-

Deficient Mutant in Barley. Front Plant Sci. 2017; doi:10.3389/fpls.2017.01936.

5. Zhang Q, Zhang X, Wang S, Tan C, Zhou G, Li C. Involvement of alternative

splicing in barley seed germination. P loS one. 2016;

10.1371/journal.pone.0152824.

6. Jia Q, Tan C, Wang J et al. Marker development using SLAF-seq and whole-

genome shotgun strategy to fine-map the semi-dwarf gene ari-e in barley. BMC

Genomics. 2016; doi:10.1186/s12864-016-3247-4.

7. Zhou G, Zhang Q, Zhang XQ, Tan C, Li C. Construction of high-density genetic

map in barley through restriction-site associated DNA sequencing. PloS one.

2016; doi:10.1371/journal.pone.0133161.

8. Zhou G, Zhang Q, Tan C, Zhang XQ, Li C. Development of genome-wide InDel

markers and their integration with SSR, DArT and SNP markers in single barley

map. BMC Genomics. 2016; doi:10.1186/s12864-015-2027-x.

188

9. Yang H, Jian J, Li X, Renshaw D, Clements J, Sweetingham MW, Tan C, Li C.

Application of whole genome re-sequencing data in the development of

diagnostic DNA markers tightly linked to a disease-resistance locus for marker-

assisted selection in lupin (Lupinus angustifolius). BMC Genomics. 2015;

doi:10.1186/s12864-015-1878-5.

189

Appendix 3- One manuscript

190

Sympatric ecological speciation of wild barley driven by microclimates at Evolution

Canyon, Israel

Cong Tan1#

, Fei Dai2#

, Camilla Beate Hill1#

, Penghao Wang3, Brett Chapman

4, Qisen Zhang

5,

Yong Jia1, Xiaolei Wang

2, Xiao-Qi Zhang

1, Roberto Barrero

4, Gaofeng Zhou

1, Yong Han

1,

Wenying Zhang6, Manuel Spannagl

6, Ilka Braumann

7, Alan H. Schulman

8,9, Peter

Langridge10

, Matthew Bellgard4, Robbie Waugh

11, Dongfa Sun

12, Eviatar Nevo

13*, Guoping

Zhang2*

, Chengdao Li1,14*

1Western Barley Genetics Alliance, Western Australian State Agricultural Biotechnology

Centre, School of Veterinary and Life Sciences, Murdoch University, WA 6150, Australia

2Department of Agronomy, Zhejiang Key Laboratory of Crop Germplasm, Zhejiang

University, Hangzhou 310058, China

3School of Veterinary and Life Sciences, Murdoch University, Murdoch, WA 6150, Australia

4Centre for Comparative Genomics, Murdoch University, South Street, Murdoch, Perth, WA

6150, Australia

5Australian Export Grains Innovation Centre, 3 Baron-Hay Court, South Perth, WA 6155,

Australia

6Plant Genome and Systems Biology, Helmholtz Center Munich – German Research Center

for Environmental Health, 85764 Neuherberg, Germany

7Carlsberg Laboratory, Gamle Carlsberg Vej 10, 1799 Copenhagen V, Denmark

8LUKE/BI Plant Genomics Lab, Institute of Biotechnology, Viikki BiocenterUniversity of

Helsinki

9LUKE Natural Resources Institute Finland

10School of Agriculture, Food and Wine, Waite Research Institute, The University of

Adelaide, PMB 1, Glen Osmond, SA 5064, Australia

11The James Hutton Institute, Invergowrie, Dundee DD2 5DA, Scotland, UK

12Huazhong Agricultural University, Wuhan, China

13Institute of Evolution, University of Haifa, Haifa 3498838, Israel

14Department of Primary Industries and Regional Development, South Perth, WA 6155,

Australia

#these authors contributed equally to this work

191

*Corresponding authors: Chengdao Li ([email protected]); Guoping Zhang

([email protected]); Eviatar Nevo ([email protected])

192

Abstract

Sympatric speciation is a highly contentious concept and it is under strong debate whether

natural selection can outperform the continuous homogenizing force of free gene flow in a

sympatric ecological condition and generate reproductive incompatibility. Here, we provide

substantial new genomic evidence for adaptation-driven incipient sympatric ecological

speciation of two wild barley populations within 250 m from the ‘Evolution Canyon’ (EC),

consisting of two abutting slopes with drastically different microclimates. Phylogenetic

relationship and population structure analysis uncovered extraordinary genetic divergence

between interslope EC populations, considerably greater than between the Tibetan wild and

cultivated barleys. A genome-wide FST analysis revealed detailed information on the

genomic differentiation with a total of 243 Mb genomic regions under environmental

selection, representing ~5% of the barley reference genome. One prominent ‘chromosome

island’, spanning a continuous region of around 200 Mb, was identified on chromosome 3H

under strong natural selection. A homologous gene involving the indica-japonica rice

speciation located in the selection sweep region with unique amino acid substitution between

the two populations. In-situ hybridization results demonstrated potential chromosome

structure changes in this region. RNA-seq analysis demonstrated that the genotype from one

slope has evolved unique mechanisms in responding to heat and drought. Our study provides

comprehensive genomic evidence and new insights to understand the process of incipient

sympatric speciation of wild barleys.

193

Wild barley (Hordeum spontaneum L.) is the ancestor of cultivated barley (Hordeum vulgare

L.) with a wide geographic distribution across highly diverse environments throughout the

Near East, from the Eastern Mediterranean coasts to Tibet1. Genomic resources obtained

from wild crop relatives have dramaticly contributed to enhancing crop productivity2,3

and

understanding the genetic basis of adaptation and speciation. The evolutionary processes that

drive the extraordinary adaptive genomic divergence in wild crops hold great potential for

crop improvement3-5

.

The high genetic and environmental heterogeneity across the distribution range of wild barley

suggests local adaptation and genomic differentiation along macro and microenvironmental

gradients. The ‘Evolution Canyon’ I (EC), located in Lower Nahal Oren, Mount Carmel in

Israel, is a microsite model for biodiversity evolution, adaptation and incipient sympatric

speciation across life6-8

. Different biotic and abiotic stress pressures at EC make it well suited

to characterize the causes and driving forces of evolution in action3,9

. EC consists of two

abutting slopes separated (on average) by a distance of only 250 m: the temperate, forested,

cool and humid, ‘European’ north-facing slope (NFS) and the tropical, savannoid, hot and

dry, ‘African’ south-facing slope (SFS). The geology and macroclimates are very similar on

the opposing slopes, but they differ sharply in their microclimates (Fig. S1). The SFS features

extreme levels of solar radiation (200‒800% more than NFS) with higher daily temperatures

and drought, whereas NFS is characterized by a cooler climate with a higher relative

humidity (1–7%) than SFS3,6,10

. The sharp microclimatic divergence between the abutting

slopes10

has resulted in contrasting populations of microbial, animal, fungi and plant species8.

We investigated the genomic and transcriptomic signatures of microevolution of wild barley

from EC, collected from mid-stations on the abutting slopes along the North–South

microclimatic gradient of EC. We used high coverage (~40X) whole genome sequencing,

with a comparison to Tibetan wild as well as cultivated barley varieties. This study highlights

194

genetic variants (SNPs and short insertions/deletions) including genes with significant

mutations (missense, lost or gained start or stop codons) identified using the map-based

reference sequence of the barley genome11

as a reference. Genome-wide RNA profiling

identified differentially expressed genes (DEG) between NFS and SFS. This study provides

tangible evidence of ecological stress as a force that generates and maintains local adaptive

patterns, leading to incipient sympatric adaptive ecological speciation, and linking micro and

macro-evolutionary processes in wild barley populations.

Results

Dramatic divergence between wild barleys from SFS and NFS

To investigate the genetic diversity of wild barleys in the ‘Evolution Canyon’ I (EC) in Israel,

transcriptome sequencing was performed with ten samples from SFS and NFS (Fig S1) in

this study. About 3.70 Gb clean sequencing data was obtained for each wild barley accession

on average. A total of 218,709 SNPs and 8,640 InDels detected for these ten barley

accessions in transcriptome level. SNP number for individuals ranges from 92,556 to 119,675

and InDel from 4,174 to 5,407. The SNP frequency distribution of ten wild barleys in each 1

Mb region along seven chromosomes were showed in circus plot (Fig 1a). Overall, the SNP

frequency within distal regions are higher than that in centromere regions. The distribution

patterns are similar for wild barley accessions from each slope but different for accessions

from opposing slopes.

To characterize the genetic divergence of wild barleys from the opposing slopes in the

“Evolution Canyon”, neighbour-joining tree, principle component and population structure

analysis are conducted based on the SNP variants. From the neighbour-joining tree (Fig 1b),

five accessions from the SFS slopes were clearly separated with the other five accessions

from the NFS group. The evolutionary distance was calculated using F84 model12

based on

SNP variants of these ten wild barley accessions (Table S3). The distance among five

195

accessions in SFS are 0.14~0.23 and in NFS 0.18~0.16. However, the distance between SFS

group and NFS group is up to 1.03, which indicates the dramatic genetic divergence between

SFS and NFS wild barleys. Wild barleys from the opposing slopes were also apparently split

into two clusters according to the first eigenvector, which explain up to 56% of genetic

difference between SFS and NFS wild barleys (Fig. 1c). The obvious genetic divergent of

wild barley between the two slopes was also confirmed by population structure analysis (Fig

1d). When K value was set to two, five wild barleys from SFS was inferred to originate from

one ancestor while five NFS wild barleys are from another ancestor.

Bigger genomic divergence between SFS and NFS wild barleys than that between

Tibetan wild barleys and domesticated barleys.

To further investigate the genomic divergence of wild barleys from the opposing slopes, we

performed high coverage whole genome sequencing with another ten barley accessions. They

represent four barley groups: six wild barley accessions collected from the opposing slopes of

the ‘Evolution Canyon’ (SFS and NFS), two wild barley accessions from the Tibetan Plateau

(TB1, TB2) and two cultivated barleys (DB1 and DB2) (Table S1). About 190 Gb of clean

data was obtained for each barley accession on average, which corresponded to about 39.4×

genome coverage of the recently published barley genome reference11

. De novo assembly

was performed for eight out of the ten accessions with the assembly length ranging from 3.63

to 4.54 Gb and N50 ranging from 7.18 to 10.89 Kb (Table S2). More than 99% of clean reads

were mapped to the barley reference genome11

. Up to 84% of the reference sequence

positions were sequenced at least five times. Based solely on reads with unique mapping

position in the reference genome, 63,565,829 single nucleotide variants (SNPs) and

2,673,996 short insertions/deletions (InDels) were identified for ten barley accessions in total.

The distribution of SNP frequency in each 1 Mb region along chromosomes was plotted in

Fig. 2a. In contrast to the RNA sequencing data, the SNP were evenly distributed in the wild

196

barley genomes from EC. However, loss of genetic diversity in some chromosome regions

were clearly evident in the cultivated barleys and even Tibetan wild barleys.

Based on the genome-wide SNPs, neighbor-joining tree, principle component and population

structure were conducted to compare the evolution distance of wild barley between the

apposing slopes and that between Tibetan wild barley and cultivated barley. In the neighbour-

joining tree (Fig. 2b), six wild barley from the “Evolution Canyon” were separated from both

cultivated barley and Tibetan wild barley. The evolutionary distance between SFS and NFS

wild barleys is about 0.60, which is twice of that between Tibetan wild barley and cultivated

barley (~ 0.30) (Table S4). In the plot of principle component analysis (Fig. 2c), six EC wild

barleys were separated from Tibetan wild barley and cultivated barley in the first eigenvector

(~ 25%). Three SFS wild barleys were separated from three NFS wild barleys in the second

eigenvector (~ 21%). Both neighbour-joining tree and the PCAs revealed a distinct separation

of the NFS and SFS wild barley populations, in line with the sharply divergent microclimates

of the two slopes at EC. To estimate the individual evolutionary ancestry and admixture

proportions, a population genetic structure analysis with a K-value from 2 to 4 was conducted

(Fig. 2d). When K was set to 3, NFS and SFS wild barleys are indicated to be derived from

two different ancestors while domesticated barley and Tibetan wild barleys are from the third

ancestor. When K was set to 4, four groups of barleys are assumed to be derived from four

different ancestors. This result also shows that the evolutionary relationship between SFS and

NFS wild barleys are much farther than that between Tibetan wild barley and cultivated

barley.

Genomic regions under environmental selection of NFS and SFS

To uncover the genetic basis of the remarkable genetic divergence and to infer genomic

regions subjected to environmental selection, Wright’s F-statistic FST was calculated between

197

SFS and NFS wild barley groups. The frequency distribution of FST value between SFS and

NFS wild barleys was featured by a multimodal distribution (Fig. S2). The distribution of the

windowed FST value between NFS and SFS wild barley along each chromosome is shown in

Fig. 3. The total length of genomic regions with FST values above the threshold of 0.89 was

about 243 Mb, which contains 919 high confidence genes. A continuous genomic region of

about 200 Mb in the central part of chromosome 3H showed strong environmental selection.

To further investigating whether the environmental selection resulted in speciation between

the SFL and NFS wild barleys, we searched the key genes responsible for reproductive

isolation in several plant species including crops13-18

. Using BLASTP, with a threshold E-

value ≤2.00E-56 and aligned amino acids length >100, we identified 22 homologous genes

for reproductive isolation in the barley genome. 88 unique SNPs or InDels were found

between the NFS and SFS genotypes in twelve genes related to reproductive isolation

(Supplementary 2). Five of these genes were in the above selective sweep regions

(HORVU4Hr1G084220, HORVU3Hr1G052250, HORVU1Hr1G060040,

HORVU3Hr1G107970, and HORVU3Hr1G107990). Notably, two genes

(HORVU3Hr1G052250, HORVU1Hr1G060040) mapped in the selection region showed

unique variants between SFS and NFS populations. A gene homologous to the rice hybrid

male sterility locus SaF (HORVU3Hr1G052250), resulted in indica-japonica speciation

through a two-gene/three-components interaction model15

, was located in the selection sweep

signature region of chromosome 3H (Fig. 3). A substitution of Phe287Ser between SaF+ and

SaF- impairs biological function and results in male sterility in the indica-japonica hybrid. It

is interesting to observe a substitution of Thr179Ala between the SFS and NFS genotypes.

Whether this substitution impacts reproduction between interslope wild barley populations

remains to be validated.

198

Different drought stress response of SFS and NFS wild barleys

Drought is a key climatic stress that distinguishes SFS and NFS. To compare the drought

stress response of wild barleys from the two slopes, transcriptome sequencing was performed

for SFS and NFS wild barleys with treatment to simulate drought stress in the SFS. About

44.40 Gb clean data was generated for twelve samples including SFS and NFS wild barleys

with two conditions and three replications. Differential gene expression was analysed for SFS

and NFS wild barley under drought stress and the overview information is given in Table 1.

274 genes were commonly detected to be differentially expressed in both SFS and NFS wild

barley under drought. 2,791 genes were differentially expressed only in SFS wild barley

under drought stress whereas 1,030 differentially expressed genes only in NFS wild barley

(Fig. 4). The frequency of uniquely deferential expression genes in each 10 Mb region along

seven chromosomes was showed in Fig. 5. More genes were up-regulated in SFS than NFS

wild barley, whereas more genes were down-regulated in NFS than SFS wild barley. GO

enrichment analysis was conducted for DEGs in SFS and NFS wild barleys under drought

stress (Fig. 4S, Fig. 5S). Over-represented GO terms DEGs in SFS wild barley includes

translational elongation, ribosome biogenesis, organic acid metabolic process, lipid metabolic

process and so on (Fig. 4S). The over-represented GO terms in NFS wild barley is different

with that in SFS. For example, lipid metabolic process is over-represented in the DEGs of

SFS wild barley, while it is under-represented in the DEGs of NFS wild barley (Fig. 5S). The

over-represented GO terms in NFS wild barley includes amine metabolic process, tryptophan

metabolic process and so son.

In theory, DEGs response to drought treatment from the SFS is more likely to play important

roles in enhancing the drought tolerance. Thus, we compared the protein similarity between

genes proved to be responsible for drought tolerance in nine species19

and drought-treated

DEGs in SFS wild barley but not NFS. Particularly, we have found 46 of them having

199

homologous genes with experimental evidences to control plant drought resistance (Table

S4). For example, the protein sequence identity between HORVU2Hr1G060350 and CIPK3

is up to 64% with the coverage of 95% (425/446 peptides). CIPK3 was reported to regulate

abscisic acid and cold signal transduction as a calcium sensor-associated protein kinase in

Arabidopsis20

. It was also demonstrated that the expression of CIPK3 is responsive to ABA

and stress conditions and the expression pattern of a number of stress gene markers were

altered by the disruption of CIPK3. Another drought responsive differential expression gene

HORVU2Hr1G075470 in SFS was detected to have 72% protein sequence identity with 92%

coverage with SRK2C gene. The SRK2C was reported to improve drought tolerance by

controlling the expression level of stress-responsive genes in Arabidopsis21

.

Discussion

Cultivated barley were domesticated around 10,000 years ago at the Fertile Crescent where

their wild relatives still thrive today. However, a large proportion of suitable habitats for wild

barley have been lost due to climate changes since the last glacial ages when the climate was

much wetter and cooler22

. Located at Mount Carmel in Israel, EC is the natural habitat of

wild barley and often referred to as the ‘Israeli Galapagos’8. The SFS in EC features extreme

levels of solar radiation (200‒800% higher than the NFS) and the associated higher daily

temperatures and drought3,6,10

. NFS is characterized by a cooler climate with higher relative

humidity (1–7%) than SFS. Thus, EC provides an ideal model to understand the impact of

global warming on genome evolution. Despite evidence of substantial gene flow between the

abutting slopes, significant interslope genetic variation and diversity were detected13-16

. In

this study, whole genome wide genomic diversity was investigated for both SFS and NFS

wild barleys. And dramatic genetic divergency was proved to exist between SFS and NFS

wild barleys both in transcriptome level and whole genome level. It was also suggested that

the genetic divergence between SFS and NFS wild barleys are much larger than that between

Tibetan wild barley and cultivated barley. All these evidence indicates that sympatric

speciation exists in wild barley in EC microclimate environment, which coincides with that

reported in other species23-25

.

200

Considering that the limited gene flow may be a major factor causing the sympatric

speciation of wild barley in the opposing slopes, allele frequency was analysed in whole

genome based on the Wright’s F-statistic FST. Up to 243 Mb genomic regions containing

919 high confidence genes were indicated to under the selection of differential microclimates

between SFS and NFS. 52 out of the 919 genes were differentially express in SFS wild barley

under drought stress and 18 were differentially expressed in NFS. There is a large genomic

region with length of about 200 Mb in central part of chromosome 3H showing dramatic

environmental selection. We also performed Fluorescence in situ hybridization (FISH) to

compare the overall karyotypical difference of each chromosome between SFS and NFS wild

barleys (Fig. 6). The dramatic karyotypical differenced was detected in the chromosome 3H

between SFS and NFS wild barley. To further exploit the exact genes in these genomic

regions under environmental selection, more controllable experiment materials such as

advanced mapping populations need to construct.

The most contrasting environmental difference between the opposing slopes is drought and

heat due to different levels of solar radiation. Drought alters the gene expression of

compounds that protect cells from dehydration, including genes belonging to the Dehydrin

gene group and heat shock protein 17. Suprunova et al26

measured the expression levels of

Dhn genes in tolerant and sensitive wild barley genotypes, and found that Dhn levels

increased earlier and with higher intensity in tolerant genotypes than in sensitive genotypes.

Confirming these findings, Yang et al27,28

detected higher levels of Dhn expression in the

resistant SFS compared to sensitive NFS wild barley after dehydration24

. To investigate the

influence of drought stress on the transcriptional level of wild barley from the opposing

slopes, we performed transcriptome sequencing and detected DEGs responding to the drought

treatment. We showed that the number of DEGs in SFS barley is about three times of that in

NFS barley after dehydration stress, further supporting interslope genotypic divergence

between tolerant genotypes on the SFS and sensitive genotypes on the NFS. The response to

drought treatment in SFS resulted in differential gene expression of more genes with greater

magnitude compared to NFS, including drought tolerance and earlier flowering-related genes.

Lipid metabolic process is over-represented in DEGs of SFS wild barley but under-

represented in DEGs of NFS wild barleys. Previous study indicated that the adjustment of

lipid content in cell membrane plays vital roles in response to drought stress in Arabidopsis29

.

Thus, adjustment of lipid metabolism may be an important mechanism for the wild barley in

SFS to adapt to the extreme drought and heat environment.

201

Diverse stress factors such as drought can either accelerate or delay flowering in many plant

species. Alteration of flowering in response to stress is known as stress-induced flowering,

which has recently been considered a third category of flowering response in addition to

photoperiodic flowering and vernalisation30

. Wild barley from SFS flowered one to two

weeks earlier than NFS wild barley, even under well-watered conditions25

. Different

flowering date was also assumed to be one factor limiting the gene flow between SFS and

NFS wild barley. Recently, Nevo et al3 reported that wild barley and wild emmer wheat

(Triticum dicoccoides) populations across Israel flowered about ten days earlier in 2009

compared to 1980 as a result of rising average temperatures due to global warming. In this

study, seven genes with roles in flowering time were detected within the selective sweep

regions, including PHYB and GI. PHYB modulates gene activity to influence plant growth

and development and increases drought tolerance by enhancing abscisic acid (ABA)

sensitivity in Arabidopsis thaliana31

. Here, PHYB transcript was present at higher levels in

SFS and lower levels in NFS wild barley after drought, suggesting that higher PHYB

expression alleviates the symptoms of stress more effectively in SFS wild barley. Drought

also alters the circadian clock, impacting the expression of circadian clock genes such as

GIGANTEA (GI) to mediate different signaling pathways that modulate drought stress

perception and responses32

. In Arabidopsis, GI enabled the drought escape response via

ABA-dependent activation of the flowering time genes FLOWERING LOCUS T (FT), TWIN

SISTER OF FT (TSF) and SUPPRESSOR OF OVEREXPRESSION OF CONSTANS (SOC1).

Drought is a stress that impacts both SFS and NFS, therefore, both had increased GI levels

after drought. The genotype from NFS was particularly vulnerable to drought, such that GI

was more up-regulated than down-regulated in SFS to potentially induce earlier flowering.

The wild barleys in EC have evolved a different mechanism to control flowering time

through light signalling pathway comparing to the cultivated barley in which photoperiodic

response and vernalisation are the major factors for flowering.

Sympatric speciation—the evolution of reproductive isolation without geographic barriers—

is a highly contentious concept in evolutionary biology33,34

. Whether natural selection can

outperform the continuous homogenizing force of free gene flow in a sympatric ecological

condition and generate reproductive incompatibility is under strong debate. Our study is the

first one to use whole-genome sequencing and transcriptome sequencing to investigate the

genomic differentiation pattern of sympatric speciation in a plant species. A genome-wide

FST analysis revealed detailed information on the genomic differentiation between SFS and

202

NFS barley. Notably, the total genomic region identified as a divergence outlier (FST > 0.89)

was about 243 Mb (Fig. 4), representing ~5% of the barley reference genome, and falls

within the 0.4–35.5% range reported for other plants under divergent selection35

. This value

is higher than the 3.6% reported for Oenanthe conioides and Oenanthe aquatic, two Apiaceae

plant species under sympatric speciation36

. In another study, on the sympatric speciation of a

palm tree in an oceanic island, only four of the 274 AFLP loci differed strongly between the

two diverging populations37

. In our study, the wild barley on the two opposing slopes are

under diverse and strong environmental selection involving heat, drought, illuminance and

pathogen stresses6,8

, which may explain the different result from the other two sympatric

speciation cases, where ecological selection in the corresponding plant populations was

driven by a single environmental factor such as soil pH32

or fresh water tides37

, respectively.

Moreover, the proportion of genomes under divergence in both studies was estimated from

AFLP analyses which may be biased and unable to uncover all the genomic regions under

natural selection. Interestingly, genomic divergence analyses of the SFS and NFS barley

revealed one prominent ‘chromosome island’ on chromosome 3H under strong natural

selection, spanning a continuous region of around 200 Mb (Fig. 3). This ‘unusual’ divergence

signature represents a typical genomic differentiation pattern expected for sympatric

speciation, where only those genetic regions directly related to the specific environmental

selection pressure (most likely heat, drought, illuminance stresses) will be differentiated,

while the remainder of the genome is subjected to continuous homogenizing forces due to

free gene flow in the sympatric setting35

. This is consistent with the genetic differentiation

pattern reported in the other two sympatric speciation cases of plants32,37

. In contrast, genetic

differences are expected to accumulate throughout the whole genome for allopatric speciation

with geographic isolation38

. Our results lend further support to the sympatric speciation of

wild barley at EC at the genetic divergence level. Detailed examination of the genomic

function of this chromosome island may shed light on the speciation process of plants and

allow us to understand better how wild barley evolves to adapt to highly diverse

environmental conditions across the world. Our results are also supported by the previous

studies based on F1 hybrids from 40 intraslope and 50 interslope crosses tested for a range of

vegetative and reproductive traits , supports incipient sympatric speciation in wild barley from

opposing slopes at EC7,25

. For most traits, interslope crossbred plants were significantly

inferior to intraslope hybrids.

203

In summary, we observed strong genomic differentiation of wild barley across a short

distance of 250 m due to the sharp microclimatic divergence between the abutting slope

populations at Evolution Canyon I, Israel. To put the local patter ns of genetic variation into a

broader geographic context, we also analyzed two Tibetan wild barley accessions and two

Canadian and Australian cultivated barley varieties. Sample accessions from both sides of the

slopes at EC confirmed the presence of strong genomic differentiation underlying adaptive

incipient sympatric ecological speciation. The process of adaptive incipient sympatric

speciation in wild barley at EC, due to sharp interslope microclimatic divergence is supported

by molecular markers, genomic and transcriptomic evidence, and is shared with organisms

across phylogeny from bacteria to mammals identified in previous studies8. The genetic base

of cultivated barley has rapidly narrowed resulting in loss of genetic diversity, increasing the

risk that cultivated barley will become increasingly susceptible to diseases and environmental

stresses. Wild and cultivated barley are both diploid and inter-fertile opening up the

possibility of using wild barley for genetic variation in the improvement of cultivated barley

against abiotic and biotic environmental stresses.

204

Methods

Sample collection. The two slopes of ‘Evolution Canyon’ (EC) have seven sampling stations

(Fig. S1). Ten barley accessions used in this study for transcriptome sequencing were

independent accessions collected from SFS station #2 and NFS station #6. The ten barley

accessions used in this study for whole-genome sequencing were from four different groups:

Tibetan wild barley (W1 and X1), Australian and Canadian malting barley cultivars (Baudin

and AC Metcalfe), and wild barley from EC, Mount Carmel, Israel (South Face Slope station

SFS2.1, SFS2.2, SFS2.3 and North Face Slope station NFS6.1, NFS7.1 and NFS7.2). The

Tibetan wild barleys were collected on the Tibetan Plateau by T.W. Xu39

. The two

commercial barley cultivars Baudin (Australia) and AC Metcalfe (Canada) are premium

malting quality varieties.

Transcriptome sequencing. Wild barley seeds were sown in pots filled with peat under

well-watered conditions. When the plants reached the third leaf stage, watering was ceased

for the drought treatment but continued for the controls for a further 80 days. Fully expanded

leaves on both the control and droughted plants were sampled with three replicates and flash

frozen in liquid nitrogen for RNA extraction. Total RNA was extracted using TRIzol Reagent

(Invitrogen, Carlsbad, CA, USA) and purified using the RNeasy Mini Kit (Qiagen,

Germantown, MD, USA). 2×150 bp paired-end sequencing libraries were constructed and

sequenced on an Illumina HiSeq2000 platform (Illumina, San Diego, CA, USA) as described

previously40

.

Whole-genome sequencing. Genomic DNA was extracted from the leaves of ten barley

plants, one plant per accession. The DNA was analyzed for quality and concentration using

agarose gels and an Agilent Technologies 2100 Bioanalyzer (Agilent Technologies, Santa

Clara, CA, USA). Genomic DNA was fragmented and then used to prepare Illumina paired-

205

end libraries, following the manufacturer’s instructions (Paired-End Sample Preparation

Guide, Illumina, 1005063). Insert fragment lengths of 500 bp, 700 bp and 1 kb were

generated. A final set of 48 libraries was then sequenced on the HiSeq 2000 platform in BGI-

Shenzhen (Beijing Genomics Institute-Shenzhen, Shenzhen, China).

Genome assembly and assessment. All reads were filtered for quality using Trimmomatic41

,

with a window of 4 bp requiring an average >Q20 and minimum read length of 50 bp.

Genome assemblies for each of the eight accessions were performed at the Pawsey

Supercomputing Centre (Murdoch University, Australia) using 32 GB memory and six cores,

using SOAPdenovo242

with k=51. The overall scaffold numbers and lengths of the

assemblies were determined, and the level of completeness was assessed using BLASTN43

.

Variant calling and validation. The barley pseudo-molecular genome11

was used as the

reference to mapping the clean reads using BWA-MEM44

with default parameters. PCR

Duplications were marked and removed using Picard (version 1.129). Only those reads

having unique mapping positions in reference genome were remained and proceeded to detect

genomic variation (SNPs and InDels). Sequence variations (insertions, deletions and SNPs)

were detected by running three rounds of SAMTools plus BCFToools and the Genome

Analysis ToolKit (GATK) variants calling pipeline. Briefly, the first round was performed

with SAMTool plus BCFTools and filtered based on the mapping quality and variants calling

quality. The result of the first round was used as the guide of realignment around potential

InDel and variants calling for the second round in GATK. The common variants detected by

both SAMTools plus BCFTools pipeline and GATK pipeline were used to guide variants

calling in the third round using GATK.

Population structure analysis. The phylogenetic tree was constructed from the neighbour-

joining method using PHYLIP (version 3) based on evolutionary distance, which was

206

computed with the F84 model using the DNADIST utility in PHYLIP package. The tree file

was constructed and annotated using MEGA45

(version 6) and TreeView (version 1.6.6).

Principle component analysis was performed with the EIGENSOFT program (version 6).

Subpopulation components were estimated using the ADMIXTURE program (version 1.23)

with the K cluster ranging from 1 to 6.

Selective sweep analysis. Selective sweep analysis is one of the main methods for detecting

environmental or natural selection signatures. Genomic regions under selective sweeps due to

environmental stresses on opposing slopes were measured using the fixation index FST.

Selective sweep analysis was conducted by measuring the patterns of allele frequencies in

each 1 Mb fragment along chromosomes using the whole-genome-wide SNP variation data.

Genomic regions with FST values above 0.89 were considered under strong selective sweeps.

Differential expression analysis . From an initial 368,900,936 paired-end raw reads,

295,697,023 clean reads were retained after filtering. Retained pair-ended reads were aligned

against the up-to-date version of the Morex barley pseudomolecule sequences using the open

source Kmer adaptive aligner Biokanga (version 3.8.7). In total, 81,683 transcripts resulted in

130,419,949 unique alignments. A count matrix was generated and loaded into R (version

3.3.1) for downstream statistical analysis. Read counts were normalized using the model-

based Trimmed Mean of M-values (TMM) method of Robinson and Oshlack46

, followed by

fitting of a negative binomial generalized linear model (GLR) accommodating the genotype

differences and effects induced by the drought treatment. The edgeR package47

(version

3.14.0) was used for transcript differential expression analysis. A total of 21,122 transcripts

were included in the differentially expressed (DE) analysis with more than ten reads for at

least three samples. A likelihood ratio test for comparing the genotypes was conducted to

obtain DE transcripts, and the resulting P-values were adjusted for multiple testing using

207

Hochberg’s FDR adjustment approach48

. Transcripts were considered highly differentially

expressed when the adjusted P-value was <0.05. Meanwhile, GO enrichment analysis was

performed on the differentially expressed genes. The gene subsets were compared to the high

confidence gene set of barley genome reference11

to determine enriched or depleted gene

ontology terms. For the enrichment analysis, the free and open-source R package GOstats49

(hypergeometric test) was used with p-value cutoff of 0.05.

Function and pathway annotation. Function and pathway annotation based on protein

homology was performed using an automatic functional annotation and classification tool

(AutoFACT). Briefly, coding sequences were aligned to NT database using program blastn

and to protein databases including NR, Swiss-Prot, KEGG using program blastx with the

same threshold. And then the function and pathway annotations of the best hits with a

biological informative description were assigned to query sequences.

208

References

1. Zohary, D., Hopf, M. & Weiss, E. Domestication of Plants in the Old World: The origin and spread of domesticated plants in Southwest Asia, Europe, and the Mediterranean Basin, (Oxford University Press on Demand, 2012).

2. Hajjar, R. & Hodgkin, T. The use of wild relatives in crop improvement: A survey of developments over the last 20 years. Euphytica 156, 1-13 (2007).

3. Nevo, E. "Evolution Canyon," a potential microscale monitor of global warming across life. Proc Natl Acad Sci U S A 109, 2960-5 (2012).

4. Nevo, E., Shewry, P.R. & others. Origin, evolution, population genetics and resources for breeding of wild barley, Hordeum spontaneum, in the Fertile Crescent, 19-43 (CAB International, Wallingford, UK, 1992).

5. Tester, M. & Langridge, P. Breeding technologies to increase crop production in a changing world. Science 327, 818-22 (2010).

6. Nevo, E. Asian, African and European Biota Meet at Evolution-Canyon Israel - Local Tests of Global Biodiversity and Genetic Diversity Patterns. P Roy Soc B-Biol Sci 262, 149-155 (1995).

7. Nevo, E. Evolution canyon": A microcosm of life's evolution focusing on adaptation and speciation. Isr J Ecol Evol 52, 485-506 (2006).

8. Nevo, E. Evolution in action: adaptation and incipient sympatric speciation with gene flow across life at "Evolution Canyon", Israel. Isr J Ecol Evol 60, 85-98 (2014).

9. Nevo, E. Darwinian Evolution: Evolution in Action across Life at "Evolution Canyon", Israel. Isr J Ecol Evol 55, 215-225 (2009).

10. Pavlícek, T., Sharon, D., Kravchenko, V., Saaroni, H. & Nevo, E. Microclimatic interslope differences underlying biodiversity contrasts in" Evolution Canyon", Mt. Carmel, Israel. Isr J Earth Sci 52(2003).

11. Mascher, M. et al. A chromosome conformation capture ordered sequence of the barley genome. Nature 544, 427-433 (2017).

12. Felsenstein, J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17, 368-76 (1981).

13. Bikard, D. et al. Divergent evolution of duplicate genes leads to genetic incompatibilities within A. thaliana. Science 323, 623-6 (2009).

14. Chen, J. et al. A triallelic system of S5 is a major regulator of the reproductive barrier and compatibility of indica-japonica hybrids in rice. Proc Natl Acad Sci U S A 105, 11436-41 (2008).

15. Long, Y. et al. Hybrid male sterility in rice controlled by interaction between divergent alleles of two adjacent genes. Proc Natl Acad Sci U S A 105, 18871-6 (2008).

16. Yamagata, Y. et al. Mitochondrial gene in the nuclear genome induces reproductive barrier in rice. Proc Natl Acad Sci U S A 107, 1494-9 (2010).

17. Yang, J. et al. A killer-protector system regulates both hybrid sterility and segregation distortion in rice. Science 337, 1336-40 (2012).

18. Mizuta, Y., Harushima, Y. & Kurata, N. Rice pollen hybrid incompatibility caused by reciprocal gene loss of duplicated genes. Proc Natl Acad Sci U S A 107, 20417-22 (2010).

19. Alter, S. et al. DroughtDB: an expert-curated compilation of plant drought stress genes and their homologs in nine species. Database (Oxford) (2015).

20. Kim, K.N., Cheong, Y.H., Grant, J.J., Pandey, G.K. & Luan, S. CIPK3, a calcium sensor-associated protein kinase that regulates abscisic acid and cold signal transduction in Arabidopsis. Plant Cell 15, 411-423 (2003).

21. Umezawa, T., Yoshida, R., Maruyama, K., Yamaguchi-Shinozaki, K. & Shinozaki, K. SRK2C, a SNF1-related protein kinase 2, improves drought tolerance by controlling stress-responsive gene expression in Arabidopsis thaliana. Proc Natl Acad Sci U S A 101, 17306-17311 (2004).

209

22. Russell, J. et al. Genetic diversity and ecological niche modelling of wild barley: refugia, large-scale post-LGM range expansion and limited mid-future climate threats? PLoS One 9, e86021 (2014).

23. Li, K. et al. Sympatric speciation of spiny mice, Acomys, unfolded transcriptomically at Evolution Canyon, Israel. Proc Natl Acad Sci U S A 113, 8254-9 (2016).

24. Hadid, Y. et al. Sympatric incipient speciation of spiny mice Acomys at “Evolution Canyon,” Israel. Proc Natl Acad Sci U S A 111, 1043-1048 (2014).

25. Parnas, T. Evidence for Incipient Sympatric Speciation in Wild Barley Hordeum Spontaneum, at" Evolution Canyon", Mount Carmel, Israel, Based on Hybridization and Physiological and Genetic Diversity Estimates, (University of Haifa, Faculty of Science and Science Education, Department of Evolutionary and Environmental Biology, 2006).

26. Suprunova, T. et al. Differential expression of dehydrin (Dhn) in response to water stress in resistant and sensitive wild barley (Hordeum spontaneum). Plant Cell Environ 27, 297-308 (2004).

27. Yang, Z.J., Zhang, T., Bolshoy, A., Beharav, A. & Nevo, E. Adaptive microclimatic structural and expressional dehydrin 1 evolution in wild barley, Hordeum spontaneum, at 'Evolution Canyon', Mount Carmel, Israel. Mol Ecol 18, 2063-2075 (2009).

28. Yang, Z.J., Zhang, T., Li, G.R. & Nevo, E. Adaptive microclimatic evolution of the dehydrin 6 gene in wild barley at "Evolution Canyon", Israel. Genetica 139, 1429-1438 (2011).

29. Gigon, A., Matos, A.R., Laffray, D., Zuily-Fodil, Y. & Pham-Thi, A.T. Effect of drought stress on lipid metabolism in the leaves of Arabidopsis thaliana (ecotype Columbia). Ann Bot 94, 345-51 (2004).

30. Takeno, K. Stress-induced flowering: the third category of flowering response. J Exp Bot 67, 4925-34 (2016).

31. Gonzalez-Guzman, M. et al. Arabidopsis PYR/PYL/RCAR receptors play a major role in quantitative regulation of stomatal aperture and transcriptional response to abscisic acid. Plant Cell 24, 2483-96 (2012).

32. Riboni, M., Galbiati, M., Tonelli, C. & Conti, L. GIGANTEA Enables Drought Escape Response via Abscisic Acid-Dependent Activation of the Florigens and SUPPRESSOR OF OVEREXPRESSION OF CONSTANS1. Plant Physiol 162, 1706-1719 (2013).

33. Bird, C.E., Fernandez-Silva, I., Skillings, D.J. & Toonen, R.J. Sympatric Speciation in the Post "Modern Synthesis" Era of Evolutionary Biology. Evol Biol 39, 158-180 (2012).

34. Bolnick, D.I. & Fitzpatrick, B.M. Sympatric speciation: Models and empirical evidence. Annu Rev Ecol Evol S 38, 459-487 (2007).

35. Strasburg, J.L. et al. What can patterns of differentiation across plant genomes tell us about adaptation and speciation? Philos Trans R Soc Lond B Biol Sci 367, 364-373 (2012).

36. Westberg, E. & Kadereit, J.W. Genetic evidence for divergent selection on Oenanthe conioides and Oe. aquatica (Apiaceae), a candidate case for sympatric speciation. Biol J Linn Soc 113, 50-56 (2014).

37. Savolainen, V. et al. Sympatric speciation in palms on an oceanic island. Nature 441, 210-3 (2006).

38. Via, S. & West, J. The genetic mosaic suggests a new role for hitchhiking in ecological speciation. Mol Ecol 17, 4334-45 (2008).

39. Tingwen, X. Origin and Evolution of Cultivated Barley in China [J]. Acta Genetica Sinica 6(1982).

40. Dai, F. et al. Transcriptome profiling reveals mosaic genomic origins of modern cultivated barley. Proc Natl Acad Sci U S A 111, 13403-8 (2014).

41. Bolger, A.M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114-20 (2014).

42. Luo, R. et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1, 18 (2012).

210

43. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J Mol Biol 215, 403-10 (1990).

44. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589-95 (2010).

45. Tamura, K., Stecher, G., Peterson, D., Filipski, A. & Kumar, S. MEGA6: Molecular Evolutionary Genetics Analysis version 6.0. Mol Biol Evol 30, 2725-9 (2013).

46. Robinson, M.D. & Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 11, R25 (2010).

47. Robinson, M.D., McCarthy, D.J. & Smyth, G.K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139-40 (2010).

48. Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate - a Practical and Powerful Approach to Multiple Testing. J Roy Stat Soc B Met 57, 289-300 (1995).

49. Gentleman, R.C. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5, R80 (2004).

211

Acknowledgements

This work was supported by the Australian Grain Research and Development Corporation

(DAW00233) and the National Natural Science Foundation of China (31620103912 and

31471480) and the Natural Science Foundation of Zhejiang Province Grant LR15C130001.

Bioinformatic resources were provided by the Pawsey Supercomputing Centre with funding

from the Australian Government and the Government of Western Australia, and the NeCTAR

Research Cloud, a collaborative Australian research platform supported by the National

Collaborative Research Infrastructure Strategy (NCRIS). The International Barley Genome

Sequence Consortium provided pre-publication access to the barley reference genome

sequence with particular thanks to Nils Stein, Martin Mascher, Tim Close and Mats Hansson.

Authorship contributions

EN, DS, WZ, CL collected the samples. XZ, FD, WZ, CL performed DNA sequencing. FD,

XW, GZ performed RNA sequencing and the drought experiment. CT, PW, BC, RB, MB

processed the data and performed computational analyses. CT, CBH, PW, BC, RB, QZ, CL

analyzed the data and interpreted the results. CT, CBH, EN, RB, BC, YJ, QZ, PW, YH and

CL wrote the manuscript. MS, IB, AS, PL, RW, GZ and CL provided the reference genome

sequence. CL, EN and GZ conceived and designed the study. All authors have read and

approved the manuscript for publication.

Competing financial interests: The authors declare no competing financial interest

Materials & Correspondence: Correspondence and requests for materials should be

addressed to C.L.

Data availability

212

All of the raw reads for the whole-genome shotgun sequencing were deposited in NCBI-SRA

with accession number SRP076351, and sample IDs SAMN05207291 to SAMN05207298,

SAMN08038525 and SAMN08038526. Accession identifiers for ten RNA-seq samples are

SAMN05200841 and SAMN05207047-SAMN05207055.

FIGURE LEGENDS

Fig. 1 | Genomic diversity and evolution related analysis based on RNA sequencing of

ten samples from two slopes of ‘’Evolution Canyon”. a, SNP frequency distribution is

displayed as ten circular histograms representing ten barley accessions from two slopes. From

inner to outer, circle 1-10 indicates barley accessions of NFS148, NFS146, NFS145,

NFS144, NFS143, SFS125, SFS124, SFS123, SFS122, SFS120. The height of bars indicates

the SNP number (logarithm with base of 10) in each 1 Mb genomic region. b, Neighbor-Join

tree construction. c, Principle component analysis. This analysis was conducted based on

218,709 SNPs without missing data using EIGENSOFT. d, Population structure analysis was

conducted using admixture.

Fig. 2 | Genomic diversity and evolution related analysis based on whole genome

sequencing of ten samples from four different groups. a, SNP frequency distribution is

displayed as ten circular histograms representing ten barley accessions. From inner to outer,

circle 1-10 indicates barley accessions of DB2, DB1, TB2, TB1, SFS2.2, SFS2.1, SFS2.3,

NFS6.1, NFS7.2, NFS7.1. The height of bars indicates the SNP number (logarithm with base

of 10) in each 1 Mb genomic region.; b, Neighbour-Join tree construction. It was conducted

using Phylip based on the sequence distance, which was calculated based on 29429 SNPs

evenly distributed in each chromosome. c, Principle component analysis. This analysis was

213

conducted based on 54,255,907 SNPs without missing data using EIGENSOFT. d,

Population structure analysis was conducted using admixture

Fig. 3 | Selection sweep analysis in whole genome. It was conducted based on on

54,255,907 SNPs from whole genome sequencing without missing data using VCFtools. Red

horizontal line indicates the threshold of 95%, which is the Fst value of 0.89.

Fig. 4 | Statistics of DEGs under drought stress treatment in SFS and NFS. Yellow circle

indicates numbers of DEGs in SFS, blue for DEGs in NFS and overlap part is the DEGs

detected in both SFS and NFS.

Fig. 5 | Distribution of DEGs under drought stress treatment in SFS and NFS. Y-axis

indicates the differential gene number in each 10 Mb region and X-axis indicates physical

position. Blue dots represent DEGs detected only in SFS under drought stress treatment and

red represent DEGs in NFS and green for DEGs detected in both SFS and NFS

Fig. 6 | Karyotypical comparison of wild barley NFS6.1 and SFS2.3 and cultivated barley

Golden Promise by FISH.

Tables

Table 1 | Summary of the number of differentially expressed genes (DEG) of samples

from the SFS and NFS after drought stress.

Treatments Total DEG Up-regulation Down-regulation

SFS 2791 1607 1184

NFS 1030 212 818

overlap 274 129 145

exploring genomic diversity in wild and cultivated barley using ......7.3.1 phenotype of bw370 and...

Documents