comparing multivariate statistical techniques and ... · comparing multivariate statistical...

15
IAALD AFITA WCCA2008 WORLD CONFERENCE ON AGRICULTURAL INFORMATION AND IT Comparing Multivariate Statistical Techniques and Supervised and Unsupervised Neural Networks in Identifying of Subspecies and Origins of Wheat accessions Javad Khazaei 1 , Mohammad R Naghavi 2 , Mansoureh Danesh 1 and Rasoule Amirian 2 1 University College of Abouraihan, University of Tehran, Tehran, Iran. [email protected] 2 Plant Breeding Dept. Agricultural College, The University of Tehran, Karaj, Iran Abstract This paper will focus on the identification of wheat (Aegilops tauschii) accessions based on the morphological features by using neural networks (ANN) and multivariate discriminant Analysis (MDA) techniques. The use of ANN help the process of identification of unknown crop accessions. Our analysis was developed a supervised and an unsupervised ANN for Identification of 2 subspecies and 3 origins of 55 wheat accessions. Nineteen accessions were of the subspecies strangulata and 36 ones were of the subspecies tauschii. The 55 accessions were from Iran, and some other Middle East countries including Turkey, Azerbaijan, Tadjikistan, Turkmenistan, Afghanistan and Armenia and the origin (country) of 10 accessions was unknown. The phenotypic diversities among accessions were determined by measuring the Spikelet number, peduncle length, stem length, spikelet weight, seed weight, spike length, number of fertile stem, days to 50% flowering, days to 50% maturity, spike firmness, stem color, spike color, awn color, stem state, spike thickness, seed shape, awn state, number of seeds in spikelet, and wooly leaf. Unsupervised self-organizing map (SOM) and a supervised backpropagation algorithm (BP) were applied to classify 2 subspecies and 3 origins of wheat samples using 19 morphological feature. The results obtained with BP and SOM method were verified by means of a principal component analysis and hierarchical cluster analysis to check whether these well-known techniques would give similar results that although two subspecies of Ae. tauschii separated using principal component analysis and hierarchical cluster analysis methods, but these methods couldn’t well group accessions according to origin sites. The performances of both BP- and SOM ANN were more better than the both PCA and HCA methods. Therefore, ANN could be considered as a support tool in the process of identification of unknown accessions. Introduction Wheat (Aegilops tauschii) is the most important cereal crop when considering either its world production or consumption. Cultivated bread and durum wheat descend from hybridized wild grasses. Eig (1929) divided the Ae. tauschii into two subspecies named tauschii and strangulate. The definition and the identification of crop accessions are of considerable scientific and practical importance in modern agriculture. Classical methods for the identification and classification of plant accessions are based on their morphological characteristics. Morphological characters can be used to estimate the variation within and between accessions. 25

Upload: hadung

Post on 09-Jun-2019

230 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Comparing Multivariate Statistical Techniques and ... · Comparing Multivariate Statistical Techniques and Supervised and ... of the subspecies strangulata and 36 ones were of the

IAALD AFITA WCCA2008 WORLD CONFERENCE ON AGRICULTURAL INFORMATION AND IT

Comparing Multivariate Statistical Techniques and Supervised and Unsupervised Neural Networks in Identifying of Subspecies and Origins of Wheat accessions Javad Khazaei 1, Mohammad R Naghavi 2, Mansoureh Danesh 1 and Rasoule Amirian 2 1 University College of Abouraihan, University of Tehran, Tehran, Iran. [email protected] 2 Plant Breeding Dept. Agricultural College, The University of Tehran, Karaj, Iran Abstract This paper will focus on the identification of wheat (Aegilops tauschii) accessions based on the morphological features by using neural networks (ANN) and multivariate discriminant Analysis (MDA) techniques. The use of ANN help the process of identification of unknown crop accessions. Our analysis was developed a supervised and an unsupervised ANN for Identification of 2 subspecies and 3 origins of 55 wheat accessions. Nineteen accessions were of the subspecies strangulata and 36 ones were of the subspecies tauschii. The 55 accessions were from Iran, and some other Middle East countries including Turkey, Azerbaijan, Tadjikistan, Turkmenistan, Afghanistan and Armenia and the origin (country) of 10 accessions was unknown. The phenotypic diversities among accessions were determined by measuring the Spikelet number, peduncle length, stem length, spikelet weight, seed weight, spike length, number of fertile stem, days to 50% flowering, days to 50% maturity, spike firmness, stem color, spike color, awn color, stem state, spike thickness, seed shape, awn state, number of seeds in spikelet, and wooly leaf.

Unsupervised self-organizing map (SOM) and a supervised backpropagation algorithm (BP) were applied to classify 2 subspecies and 3 origins of wheat samples using 19 morphological feature. The results obtained with BP and SOM method were verified by means of a principal component analysis and hierarchical cluster analysis to check whether these well-known techniques would give similar results that although two subspecies of Ae. tauschii separated using principal component analysis and hierarchical cluster analysis methods, but these methods couldn’t well group accessions according to origin sites. The performances of both BP- and SOM ANN were more better than the both PCA and HCA methods. Therefore, ANN could be considered as a support tool in the process of identification of unknown accessions. Introduction Wheat (Aegilops tauschii) is the most important cereal crop when considering either its world production or consumption. Cultivated bread and durum wheat descend from hybridized wild grasses. Eig (1929) divided the Ae. tauschii into two subspecies named tauschii and strangulate. The definition and the identification of crop accessions are of considerable scientific and practical importance in modern agriculture. Classical methods for the identification and classification of plant accessions are based on their morphological characteristics. Morphological characters can be used to estimate the variation within and between accessions.

25

Page 2: Comparing Multivariate Statistical Techniques and ... · Comparing Multivariate Statistical Techniques and Supervised and ... of the subspecies strangulata and 36 ones were of the

IAALD AFITA WCCA2008 WORLD CONFERENCE ON AGRICULTURAL INFORMATION AND IT

In the last years morphological data have been used to resolve the complex problem of the definition and classification of crop accessions. Multivariate statistical techniques, mainly principal component analysis (PCA), linear discriminant analysis (LDA), and, to a lesser extent, hierarchical cluster analysis (HCA), have been widely used for the assessment and classification of plant accessions according to their morphological characteristics (Aghaei et al., 2008; Manjunatha et al., 2007).Hammer (1980) and Knaggs et al. (2000) studied the germplasm of Ae. tauschii species using morphological characters.

DNA marker technologies (randomly amplified polymorphic DNA, RAPDs, AFLP) have also proven to be useful tools for characterization of varieties (Malaki et al., 2006; Peng et al., 2003). However, molecular methods are precise but they are very expensive. The LDA and PCA are linear transformation that are well suited for separating multidimensional data for different objects or class (Aghaei et al., 2008; Manjunatha et al., 2007). Linear transforms typically extract information from only the second-order correlations in the data (covariance matrix) and ignore higher-order correlations in the data. Many researchers have reported that many multidimensional data in the real world are inherently non-symmetric (Scholkopf et al., 1998; Siripatrawan, 2008). As an alternative to multidimensional analysis of variance, there are a new technique that may be applied to the process of identification and classification of crop accessions with the hope of making the accessions identification easier and faster, even automatically. These are artificial neural networks. Artificial neural networks (ANNs) are among the most commonly used nonlinear techniques.

Two main classes of ANNs have been used in computer expert systems biological objects identification systems. The first is supervised ANNs, and the second one is unsupervised ANNs. Supervised learning ANNs are calibrated to classify samples using a training set for which the desired target value of each sample is known and specified. The aim of this learning procedure is to find a mapping from input patterns to targets, in this case a mapping from morphological features patterns to accessions classes. The multilayer perceptrons (MLP) neural network or the feed forward ANN has been the most popular. Unsupervised learning allegedly involves no target values. Indeed, the term ‘unsupervised’ means that the knowledge of crop is not learned from the specific input–output examples. Unsupervised competitive learning is used in a wide variety of fields under a wide variety of names, the most common of which are data partitioning and classification, "cluster analysis". A popular unsupervised ANN for clustering is Kohonen’s self-organizing map (SOM) devised mainly for visualization of nonlinear relations of multi-dimensional data (Kohonen, 1995). SOM can be used for grouping of complex sample data without any strict assumption and without any priori knowledge of the number of groups present (Giraudel and Lek, 2001). With the unsupervised networks, patterns are presented to the network and it forms its own groupings of the data. In contrast, with supervised training, which is appropriate for identification, data patterns of known identity are presented to the ANN as exemplars. Once trained, any data pattern can be presented to the ANN and the output analysed to find the most likely identity of that pattern. ANNs can be applied as an alternative to various statistical procedures, and are particularly useful in cases of non-linear relationships between predictor and dependent variables (Basheer and Hajmeer, 2000). Several studies showed that, for the purpose of classification, ANNs often have superior predictive performance when compared to conventional statistical procedures, i.e. discriminant analysis and logistic regression (e.g. Manel et al., 1999). Moshou et al. (2005) found that that the performance of SOM-ANNs in classification of plant disease was better than

26

Page 3: Comparing Multivariate Statistical Techniques and ... · Comparing Multivariate Statistical Techniques and Supervised and ... of the subspecies strangulata and 36 ones were of the

IAALD AFITA WCCA2008 WORLD CONFERENCE ON AGRICULTURAL INFORMATION AND IT

that of quadratic discriminant analysis methods. Kurdthongmee (2008) reported that SOM could be used successfully for classification of wood boards of naturally different shades and colours based on the colour features. The Self-organizing map was used successfully to classify the bryophytes according to the concentrations of the chemical elements (Samecka-Cymerman et al., 2007).

Supervised ANNs has also been applied as an alternative to classical multivariate statistical techniques in crop varieties classification. For example, the supervised ANNs has been used to the classification of olive variety on the basis of chemical indices (Marini et al., 2004). Ceca and Moro (1997) reported that supervised ANNs could be considered as a promising technique to developed support tools for the process of selection of new varieties. They found that Multivariate Discriminant Analysis does not offer any substantial advantage over the traditional multidimensional analysis of variance for the process of selection of new varieties.

The objective of the present paper was to describe and evaluate the efficiency of supervised (back-propagation ANN) and unsupervised (self-organized-map) artificial neural networks to identify the origin and subspecies of 55 wheat accessions based on their morphological characteristics. The aim is to explore new ways to make the identification of wheat accessions with the unknown origin, or at least to find procedures that would help the process. The results were compared to those obtained with PCA method, common multivariate statistical techniques for crop accessions analysis. Materials and Methods Fifty five accessions of Aegilops tauschii were provided by the gene bank of the Agricultural college at the University of Tehran, Iran. Eighteen accessions were of the subspecies strangulata and 37 accessions were of the subsp. tauschii. The accessions evaluated were from Iran, some other Middle East countries (Turkey, Azerbaijan, Tadjikistan, Turkmenistan, Afghanistan and Armenia) and the origin (country) of 18 accessions was unknown (Table 1). Each accession was planted in 1m long rows with 0.5 m row spacing in experimental station of Agriculture College at the University of Tehran, Iran, during 2004. For characterization and evaluation, morphological data were recorded following descriptors established for Aegilops (IBPGR, 1981) with some modifications. Morphological data included 9 quantitative and 10 qualitative characters as follow:

Data on days to 50% flowering (days from sowing to appearance of 50% flowers), days to 50% maturity (days from sowing to appearance of first maturity) were recorded by a single value for each row. Spikelet number, peduncle length (cm), stem length (cm), spikelet weight (g), seed weight (g), spike length (cm) and number of fertile stem from 5 plants which had been randomly chosen in each row and mean of quantitative data sets were used for analysis. Ten qualitative morphological characters and scoring pattern used in this study are explained in Table 2.

For each country of origin the data recorded were analyzed for simple statistics, i.e., mean and standard deviation. All accessions of Ae. tauschii were subject to procedure of principal component analysis with the help of computer software 'SPSS'. Principal component analysis of data was performed to investigate the importance of different characters in explaining multivariate polymorphism (Mallikarjuna et al., 2003) and then the accessions were plotted according to their first two principal component scores.

Using the 19 feature set, four pattern recognition methods were essayed to identify 2 subspecies (tauschii and strangulata) and 3 origins (Iran, some other Middle East Countries,

27

Page 4: Comparing Multivariate Statistical Techniques and ... · Comparing Multivariate Statistical Techniques and Supervised and ... of the subspecies strangulata and 36 ones were of the

IAALD AFITA WCCA2008 WORLD CONFERENCE ON AGRICULTURAL INFORMATION AND IT

and unknown origin) of 55 wheat accessions: supervised ANN, unsupervised ANN, hierarchical cluster analysis (HCA), and PCA methods.

Table 1. Accessions and their geographic origin

Origin Subspecies Accessions

Iran strangulata AE67, AE27, AE39, AE4, AE5

Iran tauschii AE1, AE2, AE4, AE3, AE6, AE10, AE18, AE20, AE21, AE26, AE28, AE35, AE51, AE62, AE71

Some other Middle East Countries

Afghanistan strangulata AE33, G437 Afghanistan tauschii G435, G3407 Armenia strangulata G3448 Armenia tauschii AE74 Azerbaijan strangulata G3418, AE68 Azerbaijan tauschii G3408, AE73 Tadjikistan strangulata AE34, AE44 Tadjikistan tauschii G3428 Turkmenistan strangulata AE60, G3404 Turkey tauschii AE15, AE36

Unknown origin tauschii AE30, AE31, AE32, AE37, AE38, AE40, AE46, AE55, AE57, AE63, AE65, AE70, AE72, G3428

Unknown origin strangulata AE9, AE16, AE22, AE69

Multivariate statistical techniques HCA and PCA were performed on the data normalized in the range between 0 and 1, as this was the normalization method chosen in the analysis with SOM and BP-ANNs. In HCA, the dendrogram were obtained by using Ward's method as the linkage method and the squared Euclidean distance as distance measure. PCA was performed using varimax as rotation method, and only the principal components with eigenvalues higher than 1 were retained. Artificial Neural Networks In this study, a unsupervised SOM and a supervised BP ANNs were developed to classify 55 wheat accessions based on their three origins and two subspecies using nineteen morphological features namely: spikelet number, peduncle length, stem length, spikelet weight, seed weight, spike length, number of fertile stem, days to 50% flowering, days to 50% maturity, spike firmness, stem color, spike color, awn color, stem state, spike thickness, seed shape, awn state, number of seeds in spikelet, and wooly leaf. The SOM performance was compared with that of the more sophisticated supervised ANN technique. All 55 samples were arranged in a 19×55 matrix.

Table 2. Qualitative morphological characters and scoring pattern.

28

Page 5: Comparing Multivariate Statistical Techniques and ... · Comparing Multivariate Statistical Techniques and Supervised and ... of the subspecies strangulata and 36 ones were of the

IAALD AFITA WCCA2008 WORLD CONFERENCE ON AGRICULTURAL INFORMATION AND IT

Scoring Position

Trait Position Character

0 Frail Spike firmness

1 Firm 0 Brown

Stem color 1 Yellow-brown 0 Brown-purple

Spike color 1 Yellow-brown 0 Brown-purple

Awn color 1 Yellow-brown 0 Horizontal

Stem state 1 Straight 0 Thin

Spike thickness 1 Thick 0 Short

Seed shape 1 Tall 0 Short

Awn state 1 Tall 0 1

Number of seeds in spikelet 1 2 0 With Wooly

Wooly leaf 1 Without wolly

Back-propagation ANN In this study, a supervised BP-ANN model was developed. In the case of either origin or subspecies identification, 25 accession samples were included in the training set and the remaining, 30 patterns (including 12 accessions with the known origin and 18 sample with the unknown origin) were left as test set. In order to minimize the differences in magnitude between features, all input and output values were normalized between 0 and 1.0. The input and target output pairs were applied to train the weights of the networks with sigmoid, linear, and tangent-hyperbolic transfer functions.

In order to determine the optimal number of hidden neurons in hidden layers, training was used for 19-x-y-z-1 ANN architectures. At the first, the number of neurons in the first, second, and third hidden layers were chosen 3, zero, and zero, respectively (x = 3, y = 0, and z=0). Next, the number of neurons was increased by an increment of 3 in each step, to improve the model performance. The number of neurons in each hidden layers was studied from 3 to 30. The learning rate was varied in the range 0.1–0.7 and the momentum term between 0.4 and 0.8. The number of hidden layer, number of hidden neurons, and the total number of learning epochs were also varied between 1 to 3, 8 to 40, and between 5×103 and 10×104, respectively. Once a given neural network was trained using the appropriate training dataset, its performance was evaluated using the known samples in the training and testing dataset. The classification success was defined as the proportion of correctly classified individuals, assessed per either accession origin or accession subspecies, as a general indicator of identification success.

29

Page 6: Comparing Multivariate Statistical Techniques and ... · Comparing Multivariate Statistical Techniques and Supervised and ... of the subspecies strangulata and 36 ones were of the

IAALD AFITA WCCA2008 WORLD CONFERENCE ON AGRICULTURAL INFORMATION AND IT

The best ANN structure and optimum values of network parameters were obtained on the basis of the lowest error on training and test sets of data, by trial and error. All calculations were carried out using MATLAB 7.2 (Mathworks, Inc., Natick, MA). Self-Organizing Maps ANNs The input layer of SOM ANN consisted of 19 (morphological characteristics) neurons connected to the 55 input patterns. According to Lee et al. (2005), selecting the appropriate number of output nodes is quite difficult and this is usually experiment-dependent. There is no consensus among researchers about the subject. To obtain good mapping results, the number of output nodes in the SOM neural network should be at least 10–20% of training vectors. However, using too few output nodes may cause the congestion of input vectors over an output node, which may make it difficult to distinguish the characteristics of the output space. A SOM network consisting of x×y nodes (x and y varied from 2 to 8) was employed for classification of 55 wheat accessions in either three origin classes or two subspecies classes from the input data matrix (9×55).

The SOM map was trained using a normalized data set (data were normalized in the range 0–1 for all variables). Euclidean distance was used as an error function and the learning rate and decay were initially set at 0.6 and 0.1 respectively: neighborhood size was 1 node and neighborhood decay rate was set to 1. Data were presented in random order and initial weight distribution was taken from data. Input data were presented in random order and initial weight distribution was taken from data. Results and discussion Means and standard deviations of quantitative morphological characters for the accessions for each country of origin as well as for each subspecies are shown in Tables 3 and 4, respectively. Large variation was observed for many of the characters. Even though there are some differences among both countries and subspecies means, but the mean separation was seldom clear-cut. For spikelet weight a high within-country variation was found in most cases. The accessions from Iran presented remarkably high mean values and high standard deviations for spikelet number and spike length. In some cases, such as for Turkey, the limited number of accessions could probably account for very low standard deviations for most traits. However, it was not only because of sample size, as Armenia gene pool showed the highest standard deviations for number of fertile stem and days to 50% maturity.

The results reported in Table 4 show that the mean values of morphological features were higher for subsp. Strangulate than subsp. tauschii except for spikelet number, peduncle length, and pike length. The subsp. tauschii presented remarkably high standard deviations for all measured morphological features.

30

Page 7: Comparing Multivariate Statistical Techniques and ... · Comparing Multivariate Statistical Techniques and Supervised and ... of the subspecies strangulata and 36 ones were of the

IAALD AFITA WCCA2008 WORLD CONFERENCE ON AGRICULTURAL INFORMATION AND IT

Table 3. Mean and standard deviation of quantitative morphological characters for each country of origin of the wheat accessions.

Character Iran Some other middle East Countries

Unknown origin

Turkey Azerbaijan Tadjikistan Turkmenistan Armenia Afghanistan

Spikelet number 7.75±1.3 7.18±0.26 6.78±0.47 7.08±0.28 6.31±0.44 6.87±0.18 7.0 ± 0.71 7.08±0.87

Peduncle length 19.9±2.6 18.81±0.26 21.18±1.43 22.22±6.63 24.21±1.37 23.34±2.25 19.4 ± 6.7 22.2 ± 6.2

Stem length 39.8±4.3 36.00±2.12 43.53±3.81 43.54±4.56 43.68±4.33 46.81±2.56 41.2± 8.49 40.5± 85.4

Spikelet weight 6.9±1.54 6.53±1.10 7.75±2.05 8.31±3.27 8.30±1.41 7.94±3.62 6.9± 0.86 7.57± 1.31

Seed weight 1.4±0.24 1.28±0.02 1.49±0.28 1.38±0.34 1.48±0.05 1.58±0.14 1.5±0.10 1.50±0.18

Spikelength 7.17±1.2 6.81±0.61 6.21±0.31 6.52±0.64 5.59±0.92 6.68±0.26 6.3±0.94 7.35±2.57

Number of fertile stem 24.2±5.5 21.83±0.70 20.75±8.86 27.27±3.67 23.41±0.59 25.58±12.8 25.5±8.36 26.0±4.46

Days to 50% flowering 186.9±5 180.5±0.70 190.25±2.5 189.66±3.8 190.5±0.70 189.5±0.70 187.7±5.1 187.8±4.9

Days to 50% maturity 223±6.1 213.5±0.70 227.5±5.74 227.33±5.5 231±1.41 224.5±9.19 226±7.52 229.8±5.45

Table 4. Mean and standard deviation of quantitative morphological characters for each subspecies of the wheat accessions.

Subspecies

Spikelet number

Peduncle length

Stem length

Spikelet weight

Seed weight

Spike length

Number of fertile stem

Days to 50% flowering

Days to 50% maturity

tauschii 7.57±1.08 21.13±5.28 39.37±4.73 6.58±1.30 1.36±0.23 7.44±1.90 24.56±5.78 186.75±5.57 224.3±7.37

strangulata 6.71±0.51 20.94±2.83 43.77±4.55 8.76±1.10 1.55±0.14 6.08±0.60 25.07±5.39 189.32±2.96 229.6±3.15

Supervised and Unsupervised ANNs In this paper, the suitability of BP- and SOM-ANNs for their ability to discriminate 2 subspecies (tauschii and strangulata) and 3 origins (Iran, some other Middle East countries, and some with unknown origin) of 55 wheat accessions based on their morphological characters was investigated. The results were compared to those obtained with PCA and HCA methods, common multivariate statistical techniques for crop varieties analysis. The morphological data used to test the relevance of BP- and SOM-ANNs, including 19 qualitative and quantitative variables, originated from two subspecies of three origin (Iran, some other Middle East countries, and some with unknown origin).

The number of experimental variables to be included in the ANN models have been chosen as a compromise between the complexity of the model itself and its final predictive ability: in particular, all the 19 variables were necessary to obtain the best performance of the ANN classifiers. The results showed that generalization ability and the accuracy in the

31

Page 8: Comparing Multivariate Statistical Techniques and ... · Comparing Multivariate Statistical Techniques and Supervised and ... of the subspecies strangulata and 36 ones were of the

IAALD AFITA WCCA2008 WORLD CONFERENCE ON AGRICULTURAL INFORMATION AND IT

prediction of results of ANNs depend on the structure of the ANN (Haykin, 1999; Al-Haddad et al., 2000). Various feed-forward ANN were trained and tested using the experimental data.

The results showed that ANNs with three hidden layers trained with BP algorithm and transfer function of Sigmoeid gave the best performance for creating nonlinear mapping between input and output parameters. The best BP-ANN model for classification of wheat accession based on their origin resulted a 19-20-5-2-1 network (weighted connections including bias nodes), with learning rate = 0.43 and momentum = 0.75, trained for 10×104 epochs. Among the various ANN structures, trained with 10×104 epochs, models of good training performance for classification of subspecies was produced by the 19-15-7-1 trained with BP algorithm and transfer function of Sigmoeid.

Table 5 summarized the training and prediction accuracies of the BP-ANN from the experimental results. It indicated that the BP-ANN model was able to correctly recognize the origin of accessions for 91.65% of the training samples and to correctly predict in average 90% of the test samples with the known origin. Finally, we could stress the fact that different realizations of training and test sets did not substantially change performances, as indicated by the low standard deviations observed in the 20 independent runs. Table 5. Classification success (proportion of correct classifications) of origin and subspecies of wheat accessions based on the morphological features, using BP-ANN analysis.

Training set Testing set

n Some Middle East Countries

Iran Correct

(%) n

Some Middle East

countries Iran

Correct (%)

Accession origin

Some Middle East Countries

10 9 1 90.0 7 7 0 100

Iran 15 1 14 93.3 5 1 4 80

Unknown - - - - 18 2 16 88.8*

Accession Subspecies

Training set Testing set

n strangulat

a tausch

ii

Correct (%) n strangulata

tauschii

Correct (%)

strangulata 11 10 1 90.9 8 8 0 100

tauschii 23 1 22 95.7 13 2 11 84.6

* DNA analysis later showed that all the unknown origin were from Iran.

The ability of the BP-ANN to provide an identification of the accession with the unknown origin was also satisfactory (Table 5). The results of DNA analysis showed that all the samples with the unknown origin were from Iran. However, the best BP-ANN model was able to correctly recognize 16 out of 18 unknown samples (Table 5). The overall proportion of correct classifications of the accession with the unknown origin was 88.8%. This is a very promising result, because, even with a small data set, the use of ANN allows a quite complete discrimination of the samples using the 19 morphological features. It was apparent that the

32

Page 9: Comparing Multivariate Statistical Techniques and ... · Comparing Multivariate Statistical Techniques and Supervised and ... of the subspecies strangulata and 36 ones were of the

IAALD AFITA WCCA2008 WORLD CONFERENCE ON AGRICULTURAL INFORMATION AND IT

error scores over the training and the test set were of the same order of magnitude, thus confirming the goodness of the selection algorithm which performs a uniform mapping.

Several authors have also successfully used supervised artificial neural networks for the identification of microorganisms on the basis of complex patterns such as phenotypic data (Kennedy and Thakur, 1993), flow cytometry data (Wilkins et al., 1999), restriction patterns (Carson et al., 1995), signature lipid biomarkers (Almeida et al., 1995), pyrolysis mass spectra (Goodacre et al., 1994a,b, 1998a,b; Voisin et al., 2004), Fourier-transform infrared spectra and dispersive Raman microscopy (Goodacre et al., 1998a,b), fatty acid composition (Giacomini et al., 1997, 2000), simplified RAPD patterns (Moschetti et al., 2001) and rep-PCR genomic fingerprints (Tuang et al., 1999).

The results of unsupervised ANNs analysis of data showed that a SOM network consisting of 4×4 nodes was used to identify wheat accession of similar origin in terms of the morphological features (input data matrix 19×55). The network was trained for 10,000 epochs with patterns selected randomly as opposed to sequentially. By labeling each neuron on the map with the appropriate subgroup terms, the clustering of the whet accessions was discovered from the morphological data. The output patterns generated by the SOM network showed that the self-organizing pattern was able to distinguish the origin of different wheat accessions. The overall proportion of correct classifications of accessions was 77.12% (Table 6).

Table 6. Performances of SOM-ANNs as percentage of correct origin and subspecies identifications using the 19 morphological features.

Accession origin n

Some Middle East Countries

Iran Unknown Correct

(%)

Some Middle East Countries

17 12 5 - 70.6

Iran 20 4 16 - 80

Unknown 18 4 14 - 77.77

Accession subspecies n strangulata tauschii Correct

(%)

strangulata 19 18 1 94.7

tauschii 36 1 35 97.2

The results showed that for the 55 different accessions considered, when the origin classifier was consisting of 6×6 nodes for class membership, it reach a performance of 60%. This performance was obtained with 2×2 structures for only 54.5% of the different accession in the database.

The results also showed that the ability of the SOM to provide an identification of accessions with the unknown origin was only partly satisfactory (Table 6). The origin of four accessions classified by the DNA test as Iran were misclassified by the SOM analysis (Table 6). This may be due to the relatively low number of patterns used for training the network or even to a misclassification of accessions by cluster analysis. In general, the results showed that the ANN performance as indicated by prediction errors was marginally better for BP-ANN (Tables

33

Page 10: Comparing Multivariate Statistical Techniques and ... · Comparing Multivariate Statistical Techniques and Supervised and ... of the subspecies strangulata and 36 ones were of the

IAALD AFITA WCCA2008 WORLD CONFERENCE ON AGRICULTURAL INFORMATION AND IT

5) than for SOM networks (Table 6). The overall proportion of correct classifications of accessions origin based on the morphological features was 89.6% if BP-ANN was used, and 76.12% if SOM were used. The overall success rate was only marginally, but nevertheless significantly (P < 0.01) higher for BP-ANN than for SOM.

For SOM classification of the subspecies of wheat accessions, the 55 patterns available for the morphological features were used to train the SOM network with 4×4 nodes. This self-organizing map clearly showed that strangulata samples were separated from the subsp. Tauschii with the overall proportion of correct classifications of 95.95% (Table 6). The overall success rate was only marginally, but nevertheless significantly (P < 0.05) higher for SOM, 95.95%, than for BP-ANN, 92.3% (Tables 5 and 6). The results showed that the proportion of correct classifications of SOM networks consisting of 6×6 and 2×2 nodes were 90.7 and 83.6%, respectively. The results obtained with BP and SOM method were verified by means of a principal component analysis and hierarchical cluster analysis to check whether these well-known techniques would give similar results. HCA and PCA analysis Multivariate statistical techniques HCA and PCA were performed on the data normalized in the range between 0 and 1, as this was the normalization method chosen in the analysis with SOM and BP ANNs. Principal component analysis was done to determine which of the characters more strongly contributed to the principal components. PCA analyses reduced the original 19 characters in experiment to 3 principal components. The analysis revealed that 3 principal components accounted for 67.77% of the total variance (Table 7). Other PCs had eigenvalues <1 and have not been interpreted.

Table 7. Summary of PCA results for 19 characters in 55 accessions of Ae. tauschii.

PC3 PC2 PC1 Character

0.020 0.257 -0.743 Spikelet number -0.129 0.772 -0.090 Peduncle length -0.020 0.675 0.481 Stem length 0.451 0.208 0.701 Spikelet weight 0.442 0.060 0.721 Seed weight 0.178 0.521 -0.657 Spike length -0.010 0.673- 0.168 Number of fertile stem 0.817 -0.020 -0.08 Days to 50% flowering 0.820 -0.146 0.351 Days to 50% maturity

The PCA results can be used to categorize the wheat accessions by means of the score

plots. Figure 1 shows the score plot formed by the first two PCs, which is considered the most informative since these PCAs account for the greatest proportion of the original variance (in this case, PC1 explained 32.49% of the total variance, and PC2, 21%). The majority of accessions that formed the clusters I and II found with the SOMs. The major contribution in PC1 is due to spikelet weight and seed weight positively as well as spikelet number and spike length negatively (Table 7), which reflected the high levels of these features in the cluster I. PC2 accounted of 21% of the total variation and the characters with the

34

Page 11: Comparing Multivariate Statistical Techniques and ... · Comparing Multivariate Statistical Techniques and Supervised and ... of the subspecies strangulata and 36 ones were of the

IAALD AFITA WCCA2008 WORLD CONFERENCE ON AGRICULTURAL INFORMATION AND IT

greatest weight on this component were peduncle length and stem length positively and number of fertile stem negatively. PC3 was mainly related to days to 50% flowering and days to 50% maturity (Table 7). The score plot PC1 vs. PC2, however, did not allow any more origin groups of accessions to be separated. The representation of all other possible score plots, i.e. all possible pairs of the 3 PCs, (not shown in the article) did not allow the separation of these accessions either. Accessions from Iran are spread throughout the plot and confirm the high level of gene diversity of Ae. tauschii in Iran reported by Lubbers et al. (1991), Dvorak et al. (1998) and Pestsova et al. (2000). As the unknown accessions have been dispersed in all sites, it was difficult to find their origins (Fig. 1).

The method of principal component analysis made it possible to divide the Ae. tauschii germplasm collection into genetic groups. Based on the first two principal components, subsp. strangulata and subsp. tauschii accessions identified from each another (Fig. 1). The accessions of subsp. strangulata showed the larger dispersion than the accessions of subsp. tauschii. Knaggs et al. (2000) also studied the plant morphology of 54 accessions of Aegilops tauschii and identified the subspecies strangulata and the subsp. tauschii from each another. In contrast to the elongated glumes of subsp. tauschii, the glumes of subsp. strangulate are less elongated. Kihara et al. (1994) reported hybridization and morphologically intermediated forms between subsp. strangulate and subsp. tauschii in Iran and also gene migration between the two subspecies in this country has been reported by Dvorak et al. (1998).

PCA1

210-1-2-3

PC

A2

2

1

0

-1

-2

-3

UN

UN

UN

UN

UN

UN

UN

UN

UN

UN

UN

UN

UN

UN

UN

UN

UN

TUR

TUR

TKMTKM

TAJ

TAJ

TAJ

IRN

IRN

IRN

IRN

IRN

IRNIRN

IRN

IRNIRN

IRN

IRN

IRN

IRN

IRN

IRNIRN

IRN

IRN

IRN

AZR AZR

AZR

AZR

ARM

ARM

AFG

AFG

AFG

AFG

Fig. 1. PCA score plot of wheat accessions from some Mmiddle East

countries (ME), Iran (IR), and some with unknown origin (UN). * and ▫ represent the strangulata and tauschii subspecies respectively.

The comparison with PCA showed, therefore, that some results obtained with the SOM

were in general better than those obtained with these statistical methods. Nevertheless, and as has been reported in works of other research fields (e.g. Leflaive et al., 2005; Tison et al., 2005; Piraino et al., 2006), the SOM provided a more detailed classification, which was more useful for subsequent investigations or scientific decisions. The SOM technique creates a network that stores information in such a way that any topological relationship within the training set is

35

Page 12: Comparing Multivariate Statistical Techniques and ... · Comparing Multivariate Statistical Techniques and Supervised and ... of the subspecies strangulata and 36 ones were of the

IAALD AFITA WCCA2008 WORLD CONFERENCE ON AGRICULTURAL INFORMATION AND IT

maintained, which is not possible with PCA. A PCA plot of scores (coordinates of objects for the new variables) gives information about similarities between samples, and the plot of loadings shows correlations between the original variables and the first two factors (Legendre and Legendre, 1998). Whereas PCA only classifies a set of data for a number of objects, the neural network is not only a classifying tool but a system which, once trained, will attach (allocate) all new data to proper classes. Indeed, the great advantage of the SOM has turned out to be its powerful visualization tools, which clearly outperformed the representations that can be obtained with HCA (dendrogram) and PCA (score plot) in order to achieve the classification of accession samples taking into account the integration of multiple variables. The visualization power of PCA deteriorates considerably when 3 or more components have to be retained to explain a significant proportion of variability; in those situations, to obtain meaningful interpretations with two-dimensional score plots proves to be very difficult. While HCA is limited to showing the classification of samples with the dendrogram, the SOM component planes reveal very useful information to interpret the results that remain hidden with the traditional approaches. Simple visual inspection of the component planes makes it possible to identify the common morphological characteristics that are shared by the accessions grouped together using the SOM algorithm.

For the hierarchical cluster analysis, the results may be considered weak since there is an unquestionably discrepancy between DNA and HCA decisions (Table 8). In ten cases there is a complete disagreement (Table 8). Among these 10 cases, in six of them the HCA method was not able to identify the origin of the accessions and the origin of 4 accessions classified by the DNA test as Iran were classified as some other Middle East countries. As a summary, it can be said that it seems that neural networks could be considered as a promising technique to developed support tools for the process of classification and identification of both origin and subspecies of wheat accessions, while HCA Analysis does not offer any substantial advantage over the both BP-ANN and SOM networks. Table 8. Performances of hierarchical cluster analysis (HCA) as percentage of correct origin

and subspecies identifications using the 19 morphological features.

Accession origin n Some

Middle East Countries

Iran Unknown Correct

(%)

Some Middle East Countries

17 4 10 3 23.5

Iran 20 7 8 5 40

Unknown 18 4 8 6 44.4

Accession subspecies

n strangulata tauschii Correct (%)

strangulata 19 19 0 100

tauschii 36 7 29 80.5

Voisin et al. (2004) have also found that a supervised artificial neural network was superior to cluster analysis for the identification of 5 closely related Corynebacerium species.

36

Page 13: Comparing Multivariate Statistical Techniques and ... · Comparing Multivariate Statistical Techniques and Supervised and ... of the subspecies strangulata and 36 ones were of the

IAALD AFITA WCCA2008 WORLD CONFERENCE ON AGRICULTURAL INFORMATION AND IT

Almeida et al. (1995) reported that a Multilayer perceptron performed better than hierarchical cluster analysis in the discrimination of Mycobacterium tuberculosis on the basis of signature lipid biomarkers. Goodacre et al. (1998) found that supervised ANNs (multilayer perceptrons and radial basis function networks) trained were more effective than linear discriminant analysis and hierarchical cluster analysis for the identification of bacteria associated with urinary tract infections. A Bayesian network proved to be more successful than conventional multivariate statistical techniques (hierarchical cluster analysis, linear discriminant analysis and classification trees) for the identification of Ec. faecalis, Ec. faecium and S. thermophilus on the basis of simplified RAPD patterns (Moschetti et al., 2001).

In general we found that although two subspecies of Ae. tauschii separated using principal component analysis and hierarchical cluster analysis methods, but these methods couldn’t well group accessions according to origin sites. The performances of both BP- and SOM ANN were more better than the both PCA and HCA methods. References Aghaei, M.J., J. Mozafari., A. R. Taleei., M. R. Naghavi., M. Omidi. 2008. Distribution and

diversity of Aegilops tauschii in Iran. Genet Resour Crop Evol., 55:341–349. Almeida, J.S., A. Sonesson., D.B. Ringelberg., D.C. White. 1995. Application of artificial

neural networks to the detection of Mycobacterium tuberculosis, its antibiotic resistance and prediction of pathogenicity amongst Mycobacterium spp. based on signature lipid biomarkers. Bin. Comput. Microbiol. 7, 159–166.

Basheer, I. A., and M. Hajmeer. 2000. Artificial neural networks: fundamentals, computing, design, and application. Journal of Microbiological Methods 43: 3-31.

Ceca, J.L.G.D., and J. Moro. 1997. Comparing neural networks and Multivariate discriminant analysis in the Selection of new crop varieties. Proceedings of the First European Conference for Information Technology in Agriculture. 15–18 June 1997, Copenhagen, Denmark

Dvorak, J., M.C. Luo., Z.L. Yang., and H.B. Zhang. 1998. The structure of the Aegilops tauschii genepool and the evolution of hexaploid wheat. Theor. Appl. Genet., 97: 657-670.

Eig, A.V., 1929. Monographisch-kritische Ubersicht der Gattung Aegilops. Verlag des Repertoriums, Dahlem bei Berlin.

Fernandez, C., E. Soria., J.D. Martyn., and A.J. Serrano. 2006. Neural networks for animal science applications: two case studies, Expert Syst. Appl. 31, 444–450.

Gill, K.S. and B.S. Gill, 1994. Mapping in the realm of polyploidy: The wheat model. BioEssays, 16:841-846.

Giraudel, J.L., and S. Lek. 2001. A comparison of self-organizing map algorithm and some conventional statistical methods for ecological community ordination, Ecol. Model. 146, 329–339.

Goodacre, R., E.M. Timmins., R. Burton., N. Kaderbhai., A.M. Woodward., D.B.P. Kell., and P. Rooney. 1998. Rapid identification of urinary tract infection bacteria using hyperspectral whole-organism fingerprinting and artificial neural networks. Microbiology 144, 1157–1170.

Hammer, K. 1981. Zur Taxonomie und nomenklatur der gattung Aegilops. Feddes Rept., 91: 225-258. IBPGR, 1981. Revised descriptors for wheat. IBPGR.

37

Page 14: Comparing Multivariate Statistical Techniques and ... · Comparing Multivariate Statistical Techniques and Supervised and ... of the subspecies strangulata and 36 ones were of the

IAALD AFITA WCCA2008 WORLD CONFERENCE ON AGRICULTURAL INFORMATION AND IT

Khazaei, J., M. R. Naghavi M. R. Jahansouz, and G. Salimi-Khorshidi. 2008. Yield estimation and clustering of chickpea (Cicer arietinum L.) genotypes using soft computing techniques. Agronomy Journal, 100(4) in press.

Kihara, H. 1944. Discovery of the DD-analyser, one of the ancestors of Triticum vulgare (in Japanese). Agric. Hortic., 19:13-14.

Knaggs, P., M.J. Ambrose., S.M. Reader., and T.E. Miller. 2000. Morphological characterisation and evaluation of the subdivision of Aegilops tauschii Coss. Wheat Inform. Service, 91:15-19.

Lagudah, E.S. and G.M. Halloran. 1988. Phylogenetic relashionships of Triticum tauschii the D genome donor to hexaploid wheat. 1. Variation in HMW subunits of glutenin and gliadins. Theor. Appl. Genet., 75:592-598.

Lee, K., D. Booth., and P. Alam. 2005. A comparison of supervised and unsupervised neural networks in predicting bankruptcy of Korean firms, Expert Syst. Appl. 29, 1–16.

Leflaive, J., R. Cereghino., M.Danger., G. Lacroix., L. Ten-Hage. 2005. Assessment of self organizing maps to analyze sole-carbon source utilization profiles. J. Microbiol. Methods, 62:89–102.

Legendre, P., and L. Legendre. 1998. Numerical Ecology. Elsevier, Amsterdam. Lin, G.F., and C.M. Wang. 2006. Performing cluster analysis and discrimination analysis of

hydrological factors in one step, Adv. Water Resour. 29, 1573–1585. Lubbers, E.L., K.S. Gill., T.S. Cox., and B.S. Gill. 1991. Variation of molecular markers among

geographically diverse accessions of Triticum tauschii. Genome, 34: 354-361. Malaki, M., M. R.Naghavi., H. Alizadeh., P. Potki., M. Kazemi., S.M. Pirseyedi., M. Mardi,

and Fakhre-Tabatabaei. Study of genetic variation in wild diploid wheat (Triticum boeoticum) from Iran using AFLP markers. Iranian Journal of Biotechnology, 4(4): 269-274.

Mallikarjuna S.B.P., H.D. Upadhyaya., P.V.K. Goudar., B.Y. Kullaiswamy., and S. Singh. 2003. Phenotypic variation for agronomic characteristics in a groundnut core collection for Asia. Field Crop Res., 84:359-371.

Manel, S., J.M. Dias., and S.J. Ormerod. 1999. Comparing discriminant analysis, neural networks and logistic regression for predicting species distributions: a case study with a Himalayan river bird. Ecological Modelling 120: 337-347.

Manjunatha, T., I.S. Bisht., K.V. Bhat., and B.P. Singh. 2007. Genetic diversity in barley (Hordeum vulgare L. ssp. vulgare) landraces from Uttaranchal Himalaya of India. Genetic Resources and Crop Evolution, 54:55–65

Marini, F., F. Balestrieri., R. Bucci., A.D. Magrı., A.L. Magrı., D. Marini. 2004. Supervised pattern recognition to authenticate Italian extra virgin olive oil varieties. Chemometrics and Intelligent Laboratory Systems 73, 85– 93

McFadden, E.S. and E.R. Sears. 1946. The origin of Triticum spelta and its free threshing hexaploid relatives. J. Hered., 37: 81-107.

Moschetti, G., G. Blaiotta., F. Villani., S. Coppola., E. Parente. 2001. A comparison of statistical methods for the identification of Streptococcus thermophilus, Enterococcus faecalis and Enterococcus faecium from RAPD-PCR patterns. Appl. Environ. Microbiol. 67, 2156–2166.

Nour, M.A. and G.R. Madey. 1996. Heuristic and optimization approaches to extending the Kohonen self organizing algorithm. Eur. J. Oper. Res. 93: 428-48.

Park, S.J., C.S. Hwang., and P.L.G. Vlek. 2005. Comparison of adaptive techniques to predict crop yield response under varying soil and land management conditions. Agric. Syst. 85:59–81.

38

Page 15: Comparing Multivariate Statistical Techniques and ... · Comparing Multivariate Statistical Techniques and Supervised and ... of the subspecies strangulata and 36 ones were of the

IAALD AFITA WCCA2008 WORLD CONFERENCE ON AGRICULTURAL INFORMATION AND IT

Patnaik, D. and P. Khurana, 2001. Wheat biotechnology: A minireview. Plant Biotech., 4: Peng, S. L., Q.F. Li., D. Li., Z.F. Wang., and D.P. Wang. 2003. Genetic Diversity of Pinus

massoniana Revealed by RAPD Markers. Silvae Genetica 52: 60-63. Pestsova, E., V. Korzun., N.P. Goncharov., K. Hammer., M.W. Ganal., and M.S. Roder. 2000.

Microsatellite analysis of Aegilops tauschii germplasm. Theor. Appl. Genet., 101: 100-106.

Piraino, P., A. Ricciardi., G. Salzano., T. Zotta., E. Parente. 2006. Use of unsupervised and supervised artificial neural networks for the identification of lactic acid bacteria on the basis of SDS-PAGE patterns of whole cell proteins. J. Microbiol. Methods, 66:336–46.

Scholkopf, B., A. Smola., K.R. Muller. 1998. Nonlinear component analysis as a kernel eigenvalue problem, Neural Comput. 10, 1299–1319.

Siripatrawan, U. 2008. Self-organizing algorithm for classification of packaged fresh vegetable potentially contaminated with foodborne pathogens. Sensors and Actuators B 128, 435–441.

Tison, J., Y.S. Park., M. Coste., J.G. Wasson., L. Ector., F. Rimet., et al. 2005. Typology of diatom communities and the influence of hydro-ecoregions: a study on the French hydrosystem scale. Water Res., 39:3177–88.

Ultsch, A. and F. Roske. 2002. Self-organizing feature maps predicting sea levels. Inf. Sci. 144: 91-125.

Van Slageren, M.W. 1994. Wild Wheats: a monograph of Aegilops L. and Amblyopyrum (Jaub. & Spach) Eig (Poaceae). Wageningen Agricultural University pp: 94–7, Wageningen, the Netherlands.

Voisin, S., R. Terreux., F.N.R. Renaud., J. Freney., M. Domard., D. Deruaz. 2004. Pyrolysis patterns of 5 close Corynbacterium species analyzed by artificial neural networks. Antonie van Leeuwenhoek 85, 287–296.

Zaharieva, M., J. Prosperi., and P. Monneveux. 2004. Ecological distribution and species diversity of Aegilops L. genus in Bulgaria. Biodivers. Conserv., 13: 2319-2337.

39