predictive association between trait data and eco-geographic data for nordic barley landraces...

A Lifeboat to the Gene PoolPredictive association between trait data and eco-geographic data

for identification of trait properties useful for improvement of food crops

Vavilov Seminar at IPK GaterslebenMay 12, 2010 - Dag Endresen, NordGen

Topics:• Utilization of genetic diversity• Core collection subset• Trait mining selection (FIGS)• Computer modeling• Some examples (FIGS)

2

Domestication bottleneckunlocking genetic potential from the wild

corn, maize

wild tomato

tomato

teosinte 3

Crop Genetic Diversity

Traditional landracesCrop Wild Relatives Modern cultivars

Genetic bottlenecks during crop domestication and during modern plant breeding. The circles represent allelic variation. The funnels represents allelic variation of genes found in the crop wild relatives, but gradually lost during domestication, traditional cultivation and modern plant breeding.

4Illustration based on: Tanksley, Steven D. and Susan R. McCouch 1997. Seed Banks and Molecular Maps: Unlocking Genetic Potential from the Wild Science 277 (5329), 1063. (22 August 1997). doi:10.1126/science.277.5329.1063

• Primitive crops and traditional landraces are an important source for novel traits for improvement of modern crops.

• Landraces are often not well described for the economically valuable traits.

• Identification of novel crop traits will often be the result of a larger field trial screening project (thousands of individual plants).

• Large scale field trials are very costly, area and human working hours.

5

Plant Genetic Resources for Crop Improvement

Challenges for improved utilization of genetic resources for crop improvement :

* Large gene bank collections* Limited screening capacity

6

A needle in a hay stack• Scientists and plant breeders want a

few hundred germplasm accessions to evaluate for a particular trait.

• How does the scientist select a small subset likely to have the useful trait?

• Example: More than 560 000 wheat accessions in genebanks worldwide.

7Slide adopted from a slide by Ken Street, ICARDA (FIGS team)

Core collection subset• The scientist or the breeder

need a smaller subset to cope with the field screening experiments.

• A common approach is to create a so-called core collection.

8

Sir Otto H. Frankel (1900-1998) proposed a limited set established from an existing collection with minimum similarity between its entries.

The core collection is of limited size and chosen to represent the genetic diversity of a large collection (1984) .

Core subset selection• Given that the trait

property you are looking for is relatively rare:

• Perhaps as rare as a unique allele for one single landrace cultivar...

• Getting what you want is largely a question of LUCK!

9Slide adopted from a slide by Ken Street, ICARDA (FIGS team)

FIGS analysis methodFocused Identification of Germplasm Strategy

10

Focused Identification of Germplasm Strategy

Objective of this method:

– Explore climate data as a prediction model for “computer pre-screening” of crop traits BEFORE full scale field trials.

– Identification of landraces with a higher probability of holding an interesting trait property.

11

Climate effect during the cultivation process

Wild relatives are shaped by the environment

Primitive cultivated crops are shaped by local climate and humans

Traditional cultivated crops (landraces) are shaped by climate and humans

Modern cultivated crops are mostly shaped by humans (plant breeders)

Perhaps future crops are shaped in the molecular laboratory…? 12

Predictive pattern between eco-geography and trait

The predictive pattern between the eco-geography and the traits can of course also have other sources than adaption.

During traditional cultivation the farmer will also select for and introduce germplasm for improved suitability of the landrace to the local conditions.

13

14

FIGS selection methodAssumption: the climate at the original source location, where the landrace was developed during long-term traditional cultivation, is correlated to the trait score.

Aim: to build a computer model explaining the crop trait score (dependent variables) from the climate data (independent

variables).

We combine three datasets1) Landrace samples (genebank seed accessions)2) Trait observations (experimental design) - High cost data3) Climate data (for the landrace location of origin) - Low cost data

• The accession identifier (accession number) provides the bridge to the crop trait observations.• The longitude, latitude coordinates for the original collecting site of the accessions (landraces) provide the

bridge to the environmental data.

15

1. Genetic resources, genebank collections

16MORE THAN 7.4 MILLION GENEBANK ACCESSIONS, MORE THAN 1 400 GENEBANKS, WORLDWIDE.

Lima, Peru

Benin

Alnarp, Sweden

Svalbard

2. Trait data, descriptive crop data

http://barley.ipk-gatersleben.de

17Powdery Mildew, Blumeria graminis

Leaf spotsAscochyta sp.

Yellow rustPuccinia strilformis

Black stem rustPuccinia graminis

Faba bean, Finland Field trials, Gatersleben, Germany

Forage crops, Dotnuva, Lithuania Radish (S. Jeppson)

Potato Priekuli Latvia

Linnés äpple

http://barley.ipk-gatersleben.de/

3. Climate data – WorldClimThe climate data can be extracted from the WorldClim dataset.http://www.worldclim.org/

Data from weather stations worldwide are combined to a continuous surface layer.

Climate data for each landrace is extracted from this surface layer.

Precipitation: 20 590 stations

Temperature: 7 280 stations18

http://www.worldclim.org/

FIGS – Focused Identification of Germplasm Strategy

FIGS selection is a new method to predict crop traits of primitive cultivated material from climate variables by using multivariate statistical methods.

19

Origin of Concept (1980s):Wheat and barley landraces from marine soils in the Mediterranean region provided genetic variation for boron toxicity.

What is Focused Identification of Germplasm Strategy

Slide made byMichael Mackay 1995

http://www.figstraitmine.org/

20

South Australia

Mediterranean region


21

FIGS

The FIGS technology takes much of the guess work out of choosing which accessions are most likely to contain the specific characteristics being sought by plant breeders to improve plant productivity across numerous challenging environments. http://www.figstraitmine.org/

FIGS salinity set 21




er

y

a

n

is

F I G SOCUSED DENTIFICATION OF ERMPLASM TRATEGY

Data la

yers sie

ve acce

ssions

ba

sed on

latitud

e &

lon

gitud

e

Slide made byMichael Mackay 1995

22

Ecological Niche ModelingSpecies Distribution Models

• The fundamental ecological niche of an organism was formalized by G. E. Hutchinson[1] in 1957 as a multidimensional hypercube defining the ecological conditions that allow a species to exist.

• A computer model of the occurrence localities together with associated environmental conditions such as rainfall, temperature, day length etc., provides an approximation of the fundamental niche.

• Popular software implementations for modeling the ecological niche include openModeller, MaxEnt, BioCLIM, DesktopGARP, etc.

George Evelyn Hutchinson (1903 – 1991)

23

Computer modeling

24

Data for the simulation model• Training set

– For the initial calibration or training step.

• Calibration set– Further calibration, tuning step– Often cross-validation on the training set

is used to reduce the consumption of raw data.

• Test set– For the model validation or goodness of

fit testing.– New external data, not used in the

model calibration.

25

A model of the real world

• Validation step– No model can ever be absolutely correct– A simulation model can only be an

approximation– A model is always created for a specific

purpose• Apply the model

– The simulation model is applied to make predictions based on new fresh data

– Be aware to avoid extrapolation problems26

Model validation

27

• Residual analysis (RMSE)

• Pearson Product-Moment Correlation Coefficient (r)

Residuals (validate model fit)

28Be aware of over-fitting! NB! Model validation!

The distance between the model (predictions) and the reference values (validation) is the residuals.

Example of a bad model calibration

Cross-validation indicates the appropriate model complexity.

Calibration step

Some examples

29

Morphological traits in Nordic Barley landraces

30

Field observations by Agnese Kolodinska Brantestam (2005)

Multi-way N-PLS data analysis, Dag Endresen (2009)

Priekuli (L) Bjorke (N) Landskrona (S)

Landrace origin locations (georeferencing)

From a total of 19 landrace accessions included in the dataset, only 4 of the landrace accessions included geo-referenced coordinates in the NordGen SESTO database.

10 accessions were geo-referenced from the reported place name and descriptions of the original gathering site included in SESTO and other sources.

For 5 accessions there were not enough information available to locate the original gathering location.

Right side illustration Example of georeferencing for NGB9529, landrace reported

as originating from Lyderupgaard using KRAK.dk and maps.google.com

31

http://www.krak.dk/query?mop=aq&mapstate=7;9.305588071850734;56.61105751259899;h;9.282591620463698;56.61775781407488;9.328584523237769;56.60435721112311;853;469&what=map_adr

http://maps.google.com/maps/ms?ie=UTF8&hl=en&msa=0&msid=107144586665622662057.00045ff98921bd0418037&ll=56.606941,9.297695&spn=0.055554,0.150204&t=h&z=13

Landrace origin locations (gathering sites)accide AccNum Country Locality Elevation Latitude Longitude Coordinate

7436 NGB27 Finland Sarkalahti, Luumäki 95 m 61.0333 27.3333 SESTO

9717 NGB456 Norway Dønna, Nordland 71 m 66.1167 12.5 Georeferenced

9601 NGB468 Norway Trysil 400 m 61.2833 12.2833 Georeferenced

9600 NGB469 Norway BJØRNEBY 400 m 61.2833 12.2833 Georeferenced

7966 NGB775 Sweden Överkalix, Allsån 45 m 66.4 22.9333 SESTO

8510 NGB776 Sweden Överkalix 100 m 66.4 22.7667 SESTO

7810 NGB792 Finland Luusua, Kemijärvi 145 m 66.4833 27.35 SESTO

9538 NGB2072 Norway Finset 1220 m 60.6 7.5 Georeferenced

8482 NGB2565 Sweden Öland 11 m 56.7333 16.6667 Georeferenced

9102 NGB4641 Denmark Støvring, Jylland 55 m 56.8833 9.8333 Georeferenced

9015 NGB4701 Faroe Islands Faroe Islands 81 m 62.0167 -6.7667 Georeferenced

9039 NGB6300 Faroe Islands Faroe Islands 81 m 62.0167 -6.7667 Georeferenced

8531 NGB9529 Denmark Lyderupgaard 9 m 56.5667 9.35 Georeferenced

7344 NGB13458 Finland Koskenkylä, Rovaniemi 91 m 66.5167 25.8667 Georeferenced

32

Multi-way analysis with PLS Toolbox and MATLAB

33

3-way cube model (climate data, X)

Min. temperature

14 s

ampl

es

Climate data (mode 3):• Minimum temperature• Maximum temperature• Precipitation• … (many more layers can be added)

Jan, Feb, Mar, …

Max. temperature

Jan, Feb, Mar, …

Precipitation

Jan, Feb, Mar, …

34

12 monthly means

14 la

ndra

ces

(loca

tion

of o

rigin

)3 cl

imate

varia

bles

X

3-way cube:

2-way array (bi-linear):36 variables

3-way cube model (trait data, Y)

6 traits

Bjørke (N)2002

6 traits 6 traits 6 traits 6 traits 6 traits

14 s

ampl

es (x

2)

Mode 2 (Traits) * Heading days* Ripening days* Length of plant* Harvest index* Volumetric weight* Grain weight (tgw)

Bjørke (N)2003

Landskrona (S)2003

Landskrona (S)2002

Priekuli (Lv)2002

Priekuli (Lv)2003

Mode 3 (experiment site)* Latvia, 2002* Latvia, 2003* Norway, 2002* Norway, 2003* Sweden, 2002* Sweden, 2003

35

6 crop traits

14 s

ampl

es, l

andr

aces

(x2)

6 year

+ loca

tion3-way cube:

Y2-way array (bi-linear):

36 variables

Trait scores (pre-processing)

36

Here: Across mode 2 (traits)

Auto-scaling is a combination of mean centering and variance scaling. Mean centering removes the absolute intensity to avoid the model to focus on the variables

with the highest numerical values (intensity). Scaling makes the relative distribution of values (range spread) more equal between variables. After auto-scaling all variables have a mean of zero and a standard deviation of one. The objective is to help the model to separate the relevant information from the noise.

Trait dataset - outlier

The influence plot (residuals against leverage) shows sample NGB6300 (FRO) observed at Priekuli in 2003 (replicate 2) with a very high leverage - well separated from the “data cloud”.

After looking into the raw data (see the table above), this observation point was removed as outlier (set to NaN).

Outlier: NGB6300, replicate 2 from Priekuli 2003 (LYR122)

37

PARAFAC split-half, trait data (3-way)

PARAFAC split-half (mode 1) analysis:

The two PARAFAC models each calibrated from two independent split-half subsets, both converge to the same solutions.

The PARAFAC 3-way method produces thus a stable model for this dataset.

38

PARAFAC split-half, climate data (3-w)

39

10 different PARAFAC split-half alternatives resulted in 2 good splits

N-PLS regression

results

40

Significance levels• Often the critical levels (a) for the p-value significance is set as

0.05, 0.01 and 0.001 (5 %, 1 %, 0.1 %).• For the modeling of 14 samples (landraces) gives:

– 12 degrees of freedom for the correlation tests (mean x, y)– One-tailed test (looking only at positive correlation of predictions versus

the reference values).– A coefficient of determination (r2) larger than 0.56 is significant at the

0.001 (0.1%) level for 14 values/samples.

Many introductory text books on statistics include a table of Critical Values for Pearson’s r. 41

N-PLS regression results

The 5% and 1% significance levels indicated by the horizontal green lines42

Heading Ripening Length H-Index Vol wgt TGW Priekuli (L) Bjorke (N) Landskrona (S)

Experiment observation site

43

• Latvia 2002 (LY11)– May 2002 was extreme dry in Priekuli.– June 2002 was extreme wet in Priekuli.– The wet June caused germination on the spikes

for many of the early varieties.

• Landskrona 2003 (LY32)– June 2003 was extreme dry in Landskrona.– June was the time for grain filling here.

• Too extreme for the genotype to be “normally” expressed ?

• Too large effect from “G by E” interaction ?

Station Year Sowing week

Rainfall (mm)

May June July August

Bjørke forsøksgård, Norway

2002 17 82.9 67.4 128.5 136.5

2003 21 75.1 85.7 67.1 53.2

Landskrona, Sweden 2002 13 53.5 75.3 76.4 68.9

2003 15 70.7 40.4 76.0 45.7

Priekuli, Latvia 2002 17 38.2 111.1 67.0 11.3

2003 19 88.0 59.2 87.8 175.8

Experiment observation site (field stations)

44Priekuli (L) Bjorke (N) Landskrona (S)

N-PLS results diagnostics

Explained variance:Evaluation of model performance:

Calibration:Prediction:

45

Net Blotch inBarley Landraces

(FIGS dataset)

46

47

Net Blotch susceptibility in Barley landraces

Eddy De PauwClimate data

Harold BockelmanNet blotch data

Ken StreetFIGS project leader

Michael MackayFIGS coordinator

Dag EndresenData analysis

• Barley (Hordeum vulgare ssp. vulgare) collected from different countries worldwide screened for susceptibility of net blotch infection (1676 greenhouse + 2975 field observations).

• Net blotch is a common disease of barley caused by the fungus Pyrenophora teres.

• Screened at four USDA research stations: North Dakota (Langdon, Fargo), Minnesota (Stephen), Georgia (Athens).

• 1-3 are basically resistant group 1• 4-6 are intermediate group 2• 7-9 are susceptible group 3

• Discriminant analysis (DA):• Correctly classified groups: 45.9 % in the training set

and 44.4 % in the test set (random for 3 groups: 33 %).

• Work in progress! (SIMCA, D-PLS, Multi-way)

Trait observation locations, USDA Research Stations

48

Dr Harold Bockelman extracted the trait data (C&E) from the GRIN database, USDA-ARS, National Plant Germplasm System, Germplasm Resources Information Network, online http://www.ars-grin.gov/npgs

http://www.ars-grin.gov/npgs

49

Climate data• Agro-climatic Zone (UNESCO classification)• Soil classification (FAO Soil map)• Aridity (dryness)• Precipitation• Potential evapotranspiration (water loss)• Temperature • Maximum temperatures • Minimum temperatures

(mean values for month and year)Eddy De Pauw (ICARDA, 2008)

Sunn Pest in wheat (ICARDA)

50

• No sources of Sunn pest resistance previously found in hexaploid wheat.

• 2 000 accessions screened at ICARDA without result (during last 7 years).

• A FIGS set of 534 accessions was developed and screened (2007, 2008).

• 10 resistant accessions were found!

• The FIGS selection started from 16 000 landraces from VIR, ICARDA and AWCC

• Exclude origin CHN, PAK, IND were Sunn pest only recently reported (6 328 acc).

• Only accession per collecting site (2 830 acc).• Excluding dry environments below 280 mm/year• Excluding sites of low winter temperature below 10 degrees

Celsius (1 502 acc)

http://dx.doi.org/10.1007/s10722-009-9427-1

Slide adopted from Ken Street, ICARDA (FIGS team)

http://dx.doi.org/10.1007/s10722-009-9427-1

FIGS selection - a lifeboat to the gene pool

51