predicting chemical parameteres of water ...kt.ijs.si/dragi_kocev/wqm/ecoinf_wqm_0.8.doc · web...

PREDICTING CHEMICAL PARAMETERES OF WATER QUALITY FROM DIATOMS ABUDANCE IN LAKE PRESPA AND ITS TRIBUTARIES

Modelling the relationship between diatom abundances and physico-chemical parameteres in Lake Prespa

Andreja Naumoski1, Dragi Kocev2, Nataša Atanasova3, Kosta Mitreski1,

Svetislav Krstić4, Sašo Džeroski2

1 Faculty of Electrical Engineering and Information Technology, Skopje, Macedonia

[email protected], [email protected]

2 Dept. of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia

[email protected], [email protected]

3 Inst. of Sanitary Engineering, Faculty of Civil Engineering, University of Ljubljana, Slovenia

[email protected]

4 Institute of Biology, Faculty of Natural Sciences and Mathematics, Skopje, Macedonia, [email protected]

Abstract: In this work, we address the problem of predicting the relationship between the physico-chemical parameters of water quality from bioindicator data (diatom species abundance data). A chemical situation of the water (or water class) is defined with the values of the measured physico-chemical parameters. Traditional approach to model these data is to learn a separate model for each parameter and then derive a global overview with some kind of summarization over the multiple models. Another approach is to learn single model that describes all parameters (multi target approach). We explore these approaches and apply them on data from Lake Prespa and its tributary rivers. The obtained models revealed interesting connections between the diatom species and the water quality (i.e. the values of the chemical parameters). Some of the models express the existing ecological knowledge about the diatoms, but some of the models reveal new knowledge about the lake. Further investigation is encouraging to continue in several direction using multi-target data mining techniques, such as reconstruction of several aspects of the ecosystem past history, investigate the impact of the climate change or reconstruct the environmental patterns for certain chemical parameters.

Keywords: diatoms, Lake Prespa, decision trees, bioindicator, physico-chemical

1. Introduction

High population densities and the multiplicity of industrial and agricultural activities expose most hydrographical basins close to urban centres to heavy and rising environmental impacts. Usual approaches to water quality evaluation are divided in two main categories. One based on physical and chemical methods, and another considering biological community’s evaluation (Salomoni et al, 2006). Physical and chemical monitoring reflects only instantaneous measurements, restraining the knowledge of water conditions to the moment when the measurements were performed. Biotic parameters on the other hand provide better evaluation of environmental changes, because community development integrates a period of time reflecting conditions that might not be anymore present at the time of sampling and analysis.

Typical representatives for biological indicators usually are taxonomic organisms that are influenced by the physico-chemical parameters of the environment (as specified in the guidelines of the water quality standard), such as diatoms (specific algae species) (McCormick and Caims, 1994; Lowe and Pan, 1996; Krstic et al., 2007). Diatoms have many properties, which are required for organisms that can be used as bioindicators for environmental changes in aquatic ecosystems (Gold et al., 2002). Because of this, they are widely studied to discover relationships between diatoms and different ecological states of the ecosystems. Mainly, they are used to search for a correlation between their abundance and the eutrophication levels, as individual species of diatoms are sensitive to changes in nutrient concentrations and supply rates (Tilman, 1977; Tilman et al. 1982; Hall and Smol, 1992; Reavie et al., 1995; Fritz et al., 1993; Bennion, 1994, 1995; Bennion et al., 1996). Also, there are studies to connect the presence of diatoms with presence of metals in the ecosystem (Gold et al., 2002).

The Prespa Park (a transboundary park bordering Macedonia, Albania and Greece) is well known for its great biodiversity, natural beauty and populations of rare water birds. However, the ecological integrity of the region is threatened by the increasing exploitation of the natural resources (inappropriate water management, forest destruction leading to erosion, overgrazing), inappropriate land use practices, ecologically unsound irrigation practices, water and soil contamination from uncontrolled use of pesticides, lake siltation and uncontrolled urban development. Monitoring of the state of Lake Prespa is necessary to prevent major catastrophes in the Prespa ecosystem.

An attempt for intensive monitoring was made within the EU funded project TRABOREMA. The monitoring comprised measurements that reflect the physical, chemical and biological aspects of the lake ecosystem (Levkov et al., 2006; Krstic and Levkov, 2007). These measurements comprise relative abundances of 116 diatom species, and parameters such as temperature, pH, total phosphorus, nitrogen etc (see Table 1). The results of the monitoring revealed great amount of data about the lake ecosystem, but unfortunately, they are limited to the project duration. So, constant monitoring should be performed to keep the system in good ecological state. As this is not the case with Lake Prespa, we focus our research in constructing models (using machine learning techniques) that reveal the relationships between abundance of diatoms and physico-chemical parameters, and consequently the water quality and overall ecosystem’s health. These models can be used for the appropriate decision making process and easier management of the lake.

Typically in this area, classical statistical modelling approaches, such as PCA and CCA (Stroemer and Smol, 2004), are most commonly applied. Although these techniques provide very useful insights in the data, they are limited in terms of interpretability. On the other hand, decision trees (CART) are very easy to interpret, fast to induce and they are non-parametric. They are one of the most widely used machine learning techniques.

Having in mind that there are multiple physico-chemical parameters (in statistical terminology, multiple responses or multiple dependant variables), we investigate two scenarios: (1) induce a regression tree (RT) (Breiman et al, 1984) for each physico-chemical parameter separately and (2) induce a multi-target regression tree (MTRT) (Blockeel et al, 1998; Struyf and Dzeroski, 2006) to predict multiple parameters simultaneously. The advantages of the latter approach are that the obtained MTRT is smaller than the sum of the RTs for each physico-chemical parameter and that the MTRT is able to capture and explicate the dependencies between the observed physico-chemical parameters (Struyf and Dzeroski, 2006).

The remainder of this paper is organized as follows. In Section 2, we describe the machine learning methodology that was used (regression trees and multi-target regression trees). Section 3 describes the data and Section 4 explains the experimental design that was employed to analyse the data at hand. In Section 5, we present the obtained WQ models and discuss them. Section 6 gives the main conclusions.

2. Methodology

2.1 Regression Trees

Regression trees are decision trees that are capable of predicting the value of a numeric target variable (Breiman et al. 1984). They are hierarchical structures, where the internal nodes contain tests on the input attributes. Each branch of an internal test corresponds to an outcome of the test, and the predictions for the values of the target attribute are stored in the leaves. Regression tree leaves contain constant values as predictions for the target variable (they represent piece-wise constant functions).

To obtain the prediction of a regression tree for a new data record, the record is sorted down the tree, starting from the root (the top-most node of the tree). For each internal node that is encountered on the path, the test that is stored in the node is applied, and depending on the outcome of the test, the path continues along the corresponding branch (to the corresponding subtree). The procedure is repeated until we end up in a leaf. The resulting prediction of the tree is taken from this leaf.

The tests in the internal nodes can have more than two outcomes (this is usually the case when the test is on a discrete-valued attribute, where a separate branch/subtree is created for each value). Typically, each test has two outcomes: the test has succeeded or the test has failed. The trees in this case are called binary trees.

2.2. Multiple Targets Regression Trees

Multi-target regression trees are an instantiation of predictive clustering trees (PCTs) (Blockeel et al. 1998), where a tree is viewed as a hierarchy of clusters. The top-node of a PCT corresponds to a cluster that contains all the data. This cluster is then recursively partitioned into smaller clusters while moving down the tree. The leaves represent the clusters at the lowest level of the hierarchy and each leaf is labeled with its prototype.

Multi-target regression trees (Blockeel at al. 1998; Struyf and Džeroski 2006) are a generalization of regression trees, because they can predict the values of several numeric target attributes simultaneously. Instead of storing a single numeric value, the leaves of a multi-target regression tree store a vector. Each component of this vector is a prediction for one of the target attributes. Examples of multi-target regression trees can be found in Sections 5 and 6.

A multi-target regression tree (of which a regression tree is a special case) is usually constructed by a recursive partitioning algorithm from a training set of records. The algorithm is known as TDIDT (top-down induction of decision trees). The records include measured values of the descriptive and the target attributes. The tests in the internal nodes of the tree refer to the descriptive, while the predicted values in the leaves refer to the target attributes.

The TDIDT algorithm starts by selecting a test for the root node. Based on this test, the training set is partitioned into subsets according to the test outcome. In the case of binary trees, the training set is split into two subsets: one containing the records for which the test succeeds (typically the left subtree) and the other contains the records for which the test fails (typically the right subtree). This procedure is recursively repeated to construct the subtrees.

The partitioning process stops if a stopping criterion is satisfied (e.g., the number of records in the induced subsets is smaller than some predefined value; the depth/size of the tree exceeds some predefined value etc). In that case, the prediction vector is calculated and stored in a leaf. The components of the prediction vector are the mean values of the target attributes calculated over the records that are sorted into the leaf.

One of the most important steps in the tree induction algorithm is the test selection procedure. For each node, a test is selected by using a heuristic function computed on the training data. The goal of the heuristic is to guide the algorithm towards small trees with good predictive performance. The multi-target regression trees are implemented in the system CLUS (Blockeel and Struyf 2002) available at http://www.cs.kuleuven.be/~dtai/clus/. The heuristic used in this algorithm for selecting the attribute tests in the internal nodes is intra-cluster variation summed over the subsets induced by the test. Intra-cluster variation is defined as

with N the number of examples in the cluster, T the number of target variables, and Var[yt] the variance of target variable yt in the cluster. Lower intra-subset variance results in predictions that are more accurate. The variance function is standardized so that the relative contribution of the different targets to the heuristic score is equal.

3. Data description

Lake Prespa is located at the border intersection of Macedonia, Albania and Greece (see Figure 1). It covers an area of 301 km2 at 850 m above sea level. The whole region that surrounds the lake was recently proclaimed a transboundary park (Prespa Park). The Prespa Park is well known for its great biodiversity, natural beauty and populations of rare water birds. However, the ecological integrity of the region is threatened by the increasing exploitation of the natural resources (inappropriate water management, forest destruction leading to erosion, overgrazing), inappropriate land-use practices, ecologically unsound irrigation practices, water and soil contamination from uncontrolled use of pesticides, lake siltation and uncontrolled urban development. Monitoring of the state of Lake Prespa is necessary to prevent major catastrophes in the Prespa ecosystem (Krstić 2006).

Monitoring of the state of Lake Prespa was performed during the EU project TRABOREMA. The measurements cover one and a half year period (from March 2005 to September 2006). Samples for analysis were taken from the surface water of the lake at 14 locations. The lake sampling locations are distributed in the three countries (see Figure 1) as follows: 8 in Macedonia, 3 in Albania and 3 in Greece. The selected sampling locations are representative for determining the eutrophication impact (Krstić 2005).

Fig. 1. Position of Lake Prespa (left) and the sampling locations (right)

Through the lake measurements, a total of 218 water samples were collected. On these water samples, both physicochemical and biological analyses were performed. The physicochemical properties of the samples provided the environmental variables for the habitat models, while the biological samples provided information on the relative abundance of the studied diatoms. The following physicochemical properties of the water samples were measured: temperature, dissolved oxygen, Secchi depth, conductivity, alkalinity (pH), nitrogen compounds (NO2, NO3, NH4, inorganic nitrogen), sulphur oxide ions SO4, and Sodium (Na), Potassium (K), Magnesium (Mg), Copper (Cu), Manganese (Mn) and Zinc (Zn). The basic statistics for these variables are given in Table 1.

The biological variables were the relative abundances of 116 different diatom taxa (for a complete list of diatom names and acronyms see Table A1 in the Appendix). Diatom cells were collected with a planktonic net or as attached growth on submerged objects (plants, rocks or sand and mud). This is the usual approach in studies for environmental monitoring and screening of diatom abundance. The sample, afterwards, is preserved and the cell content is cleaned. The sample is examined with a microscope, and the diatom taxa and abundance in the samples are obtained by counting 200 cells per sample. The specific taxon abundance is then given as the percent of the total diatom count per sampling site (Levkov et al. 2006).

Table 1. Basic statistics for the physico-chemical parameters

Minimum

Maximum

Mean Value

Standard Deviation

Temperature (oC)

2.90

26.80

15.56

6.61

Saturated O2 (mg/dm3)

6.60

114.19

83.07

18.76

Secchi Depth (m)

1.80

5.40

3.09

0.71

Conductivity (μS/cm)

142.50

318.00

196.23

27.84

pH

5.50

9.27

8.17

0.64

NO2 (mg/dm3)

0.00

0.44

0.03

0.05

NO3 (mg/dm3)

0.00

13.40

2.07

2.13

NH4 (mg/dm3)

0.01

1.07

0.29

0.18

Total N (mg/dm3)

0.32

9.21

2.53

1.28

Organic N (mg/dm3)

0.02

8.41

1.83

1.10

SO4 (mg/dm3)

2.68

266.10

29.47

22.98

Total P (μg/dm3)

1.15

83.13

18.63

15.31

Na (mg/dm3)

0.75

13.15

4.36

2.10

K (mg/dm3)

0.23

4.80

1.50

0.66

Mg (μg/dm3)

1.11

19.45

5.70

2.84

Cu (μg/dm3)

1.04

23.30

3.97

2.79

Mn (μg/dm3)

0.88

230.00

7.88

16.79

Zn (μg/dm3)

0.27

22.70

5.23

4.42

Namely, out of the 10 top dominant diatoms in Lake Prespa (Table A1 in bold), Cymbopleura juriljii (CJUR) and Navicula prespanensis (NPRE) are newly described taxa with no record for their ecological preferences in the literature. According to latest diatom ecology publications (van Dam et al., 1994) and databases (European Diatom Database - http://craticula.ncl.ac.uk/Eddi/jsp/index.jsp), Amphora pediculus (APED) is an eutrophic taxon tolerant to elevated N concentrations, Cavinula scutelloides (CSCU) is also eutrophic taxon-alkalibiont, Cocconeis placentula (CPLA) is an eutrophic taxon with medium oxygen demand, Cyclotella ocellata (COCE) is a meso to eutrophic taxon, Diploneis mauleri (DMAU), Navicula rotunda (NROT) and Navicula subrotunndata (NSROT) have no records, while Staurosirella pinnata (STPNN) is a hyper-eutrophic (oligo-eutrophic; indifferent) taxon frequently found on moist habitats.

4. Experimental Design

The goal of the experimental setup is to build models for (predicting) the physico-chemical parameters using the diatom abundances. Specifically, we define the following experimental scenarios:

· Inducing models for each measured physico-chemical parameter (single-target regression tree)

· Inducing models for groups of physico-chemical parameters (multi-target regression tree)

For the second experimental scenario, we induce 3 (three) multi-target regression trees (MTRTs). The first MTRT predicts all measured physico-chemical parameters, the second MTRT predicts the parameters that indicate the eutrophication level (phosphorus, nitrogen, Secchi depth) and the third MTRT predicts the metal contamination (Na, K, Mg, Cu, Mn and Zn).

To prevent over-fitting of the models to the training data, we employed ‘F-test pruning’. This pruning method applies the statistical F-test (Lomax 2007) to check whether a given split reduces the variance significantly at a given significance level. The significance level is a user defined parameter: We employ internal 10-fold cross-validation to select an optimal value for this parameter from the following set of values: 0.05, 0.1, 0.125, 0.25, 0.5, 0.75, and 1.0. In addition, to obtain even smaller trees, we set a constraint that does not allow the trees to grow more than 4 levels in depth.

5. Results

In line with the experimental setup we induced 3 multi-target regression trees and 18 single-target regression trees. In this section, we present only the multi-target trees, while the single-target trees are presented in the Appendix.

5.1. Multi-target regression trees for the physico-chemical parameters

In this subsection, we present, describe and discuss the multi-target regression trees that were induced from the data. Figure 2 presents the model for all physico-chemical parameters, Figure 3 the model for the eutrophication parameters and Figure 4 the model for the metals.

Figure 2. Multi-target regression tree for all physico-chemical parameters

The MTRT in Figure 2 has leafs. Each leaf contains predictions for all 18 measured physico-chemical parameters. The predictions are based on the diatom abundances. The internal nodes contain the most abundant diatom species from Lake Prespa – see Table A1 in Appendix). The most important diatom species, appears to be COCE (Cyclotella ocellata), followed by DMAU (Diploneis maulerii), CJUR (Cyclotella juriljii), AMSS (Achnanthes minutissima) and CPLA (Cocconeis placentula).

Let us examine the most obvious relationships revealed by this model. First, note the variation of the temperature. The highest temperature is encountered in the fifth leaf (counting from left to right), where the COCE diatom is present, while DMAU, AMSS, and CPLA are present with low abundances or not present at all. Second, we notice the low saturation of oxygen in the third leaf. This saturation can be indicated by the presence of COCE, low or no presence of DMAU and higher abundances of AMSS. One can further look for descriptions of the relationships in this system, but must be aware of the complexity of the problem: in this case the focus must be on the entire physico-chemical reality, not just a specific parameter. For instance, the second leaf presents a situation where the concentration of NO3 is higher than in the other leafs and the concentration of Zn is lower (as compared to the other leafs).

The NO2 and NO3 components of the lake could be indicated using combination of diatoms, as the model shows that high abundances of the COCE and DMAU, and low abundance of the CJUR diatoms are suitable for nitrogen monitoring. High concentrations of ammonium and organic nitrogen component are suitable environment in which AMSS and COCE can exist in high abundance. In combination with CPLA diatom, these diatoms could indicate lower concentrations of total nitrogen.

The MTRT also can be used to indicate the metal components inside of the lake ecosystem, not just only for one metal parameter, but also a combination of several metals. Low concentrations of K, Cu, Mn, Zn, are suitable environment for the DMAU and COCE diatoms, with low abundance of CJUR. High abundance of the CPLA is adequate for high concentrations of all metals in the dataset except the Mg and Mn.

‘Macrophytes and phytobenthos’ is one of the ecosystem components (termed ‘biological quality elements’ in the WFD) that are required to be analyzed in assessments of ecological status in freshwaters. However, considering all phototrophic organisms simultaneously is problematic because of the wide range of spatial scales and life-histories encompassed within this term. Diatoms have been widely used to support decision-making in freshwater management over the past two decades, with several indices and transfer functions used to provide information on acidification (Battarbee et al., 1999), nutrients (Kelly & Whitton, 1995; Rott et al., 1999; Potapova et al., 2004) and ‘general water quality’ (Descy, 1979; Coste in CEMAGREF, 1982). These approaches, whilst good at determining the intensity of particular types of pollution, are not suitable for assessing ecological status, as required for the WFD, as they do not compare the observed state of a water body with that expected in the absence of anthropogenic disturbance (Kelly et al., 2007).

On the other hand, using diatoms as biomonitoring organism in process of detection of the ecological status encounters several obstacles (Krstic et al., 2007), related to: i) taxonomical problems - diatom taxonomy is still a problem regarding miss-determination and ‘fitting the species into known species’ pattern; ii) lack of long-term environmental data - to be linked with diatom taxa distribution and abundance, and also the problem of genetical plasticity or evolutionary adaptation patterns to changing environment; and iii) the gap between autecology and taxonomy of taxa - full long term studies on diatom taxonomy and ecology are still lacking, especially when new taxa are described from worldwide habitats. The last obstacle is very clear for Lake Prespa as well, since 70 new taxa were recently described (Levkov et al, 2006) and only one publication presented (Levkov et al., 2007) on the ecology of benthic diatoms.

Figure 3. Multi-target regression tree for the eutrophication parameters

Figure 3 presents a model for the eutrophication parameters. There are 6 leafs, where 3 parameters (Secchi depth, nitrogen and phosphorus) are modelled. All modelled diatoms, including Navicula prespanensis (NPRE) as the first record of this kind, represent a community typical for elevated eutrophication parameters in the ecosystem. The highest eutrophic state of the lake can be found when COCE and UULN diatoms are present: highest phosphorus concentration, lowest Secchi depth and high nitrogen concentrations. Similar situations are described in the third and last leaf (counting from left to right). The remaining 3 leafs (the second, fourth and fifth) describe mesotrophic state of the lake, with lower phosphorus and nitrogen concentrations and higher Secchi depth. Comparing these results to Levkov et al. (2007) who stated: “according to the CCA results, physico-chemical variables are the main factors controlling diatom ecology in Lake Prespa. The main environmental factors controlling species distribution and abundances were ammonium, metallic cations (Cu, Mn, K) and, secondarily, dissolved oxygen and Secchi depth. Except for NH4 concentration, variables related to water trophic status ([P], [N], [Organic Nitrogen]) seem to have a negligible effect on benthic diatom abundances in Prespa Lake”, the obtained multi-target regression tree (Fig.3) offers significantly improved insight of the related diatom species distribution in relation to environmental parameters of the ecosystem.

Figure 4 presents a model for the selected diatom taxa abundance in relation to metal content in the water samples. There are 7 leafs that describe specific metal concentrations that can be encountered in the lake. Generally, when AVEN diatom is low or not abundant, and ACCLB and CELL are abundant (second leaf from left to right) the lowest concentrations of the metals can be found. The AVEN diatom can be used as indicator for high contamination (especially for copper: 13.13 g/dm3 and zinc: 9.24 g/dm3), except for manganese (the lowest encountered manganese concentration – 1.42 g/dm3).

Figure 4. Multi-target regression tree for the metals

Effects of increased levels of heavy metals on epilithic algal communities is studied under laboratory and natural conditions (Levkov and Krstic, 2002). In natural communities effects are investigated at chronic (long term) exposures, what is “more realistic to what algae will experience in nature” (Genter, 1996). They react more completely than filamentous algae or macrophytes (Ivorra et al., 1999). On the other hand, there are very few literature data concerning accumulation of heavy metals in natural diatom communities (Levkov, 2001). Ivorra (2000) shows that there is a large difference in the content of heavy metals in algal communities from unpolluted and polluted sites, and mainly exhibit linear correlation with concentration of adequate heavy metal in the ambient water (Absil and van Scheppingen, 1996). This difference is due to the different species composition in diatom communities (Admiraal et al.1997). Some diatom species have developed tolerance mechanisms against cytotoxic effects of heavy metals (Torres et al., 1995, 1997) to reduce heavy metal toxicity by producing intracellular and extracellular binding components (Ahner et al., 1995; Ahner and Morel, 1995). There are no literature data on sensitivity/tolerance of diatom taxa on ambient metal concentrations or diatom community response to elevated metal impacts.

For the Prespa Lake, Levkov et al. (2007) found that ”strong co-linearity exists within some variable sets (e.g. Cu-K-Mn, Mg-SO4-O2). Concentration of Cu was significantly correlated to the relative abundance of 10 of the 17 diatom taxa analyzed, and also affected species diversity (Shannon’s index, Pearson’s R = 0.2, p = 0.01)”. Developing of the presented regression tree (Fig.3) for Prespa Lake is therefore imperatively significant in determination of hierarchical community response to metal concentrations and establishing of the corresponding monitoring system.

One can also look at the single-target regression trees where each of the physico-chemical parameters is predicted separately. With this approach, we obtain 18 different regression trees (see figure in the Appendix) and they can be interpreted separately. However, these trees are not able to describe the complex physico-chemical situation in the water. The single-target regression trees contain information for only one parameter at a time, while discarding the information for the other parameters.

5.2. Performance of the models

For each of the learned models, we estimate its predictive performance on both the training data and on unseen data (by 10-fold cross-validation). We use two metrics to evaluate the performance: correlation coefficient and root mean squared error (RMSE). In addition, we inspect the selected models in detail and interpret the knowledge contained therein, as described in the previous subsection.

The performance figures for the induced models are listed in Tables 2 and 3. Each of these tables presents the selected significance level for the F-test pruning, the performance (correlation coefficient and RMSE) and the size (total number of nodes, including leaves and internal nodes) of the produced tree. Table 2 presents the performance of the regression trees and Table 3 of the multi-target regression tree.

A quick inspection of the results shows that the prediction problem is very difficult: even on the training data, the performance is low. In order to investigate how much we can improve the predictive performance, we employed ensembles (bagging and random forests) of both regression trees (Breiman 1996, 2001) and multi-target regression trees (Kocev et al. 2007). It is well known that ensemble methods perform better than individual trees and are amongst the top performing methods for predictive modelling (Caruana and Niculescu-Mizil 2006). The results are presented in Tables A.2 and A.3 in the Appendix.

The ensemble models have better predictive performance overall. The best correlation coefficient (on unseen data) is 0.60 (bagging and random forests of regression trees), as compared to 0.38 for the regression trees (the tree for the temperature parameter) and 0.36 for the multi-target regression tree for all parameters (for the total nitrogen parameter). The F-values range from 0.05 to 0.125 for the single target trees, but most of the parameters have 0.05 values while 0.1 for multi target trees for the metal set. The F-values for the regression tree for the SD parameter is 0.1, while for the MTRT we have 0.05.

Table 2. Performance of the single-target regression trees - STRT

F-value

CC

RMSE

Size

Train

Xval

Train

Xval

Temp

0.05

0.74

0.38

4.44

6.48

25

SatO

0.05

0.66

0.37

14.11

17.96

19

SD

0.1

0.64

0.14

0.54

0.76

25

Conduc

0.25

0.67

0.32

20.71

28.37

29

pH

0.5

0.61

0.07

0.50

0.71

23

NO2

0.05

0.65

0.18

0.03

0.05

15

NO3

0.05

0.65

0.27

1.62

2.24

13

NH4

0.05

0.52

0.07

0.15

0.19

11

TotalN

0.05

0.68

0.29

0.93

1.33

21

OrgN

0.5

0.58

0.21

0.90

1.14

25

SO4

0.5

0.67

-0.01

17.10

30.27

13

TotalP

0.05

0.62

-0.01

12.03

18.40

17

Na

0.05

0.56

0.12

1.73

2.25

23

K

0.125

0.53

0.17

0.56

0.71

17

Mg

0.1

0.69

0.22

2.04

3.05

29

Cu

0.05

0.59

0.02

2.25

3.27

15

Mn

0.05

0.28

0.10

16.09

16.75

9

Zn

0.05

0.57

0.15

3.62

4.75

19

Table 3. Performance of the multi-target regression trees - MTRT

F-value

CC

RMSE

Size

Train

Xval

Train

Xval

Temp

0.05

0.43

0.18

5.96

6.82

11

SatO

0.46

0.15

16.63

19.08

SD

0.25

0.12

0.68

0.71

Conduc

0.39

0.27

25.55

26.99

pH

0.36

-0.03

0.59

0.69

NO2

0.46

0.09

0.04

0.05

NO3

0.60

0.21

1.71

2.22

NH4

0.18

0.03

0.17

0.18

TotalN

0.48

0.36

1.12

1.19

OrgN

0.31

0.22

1.05

1.08

SO4

0.09

0.01

22.83

23.09

TotalP

0.23

0.07

14.85

15.50

Na

0.30

0.17

2.00

2.08

K

0.37

0.01

0.61

0.72

Mg

0.21

0.09

2.77

2.87

Cu

0.38

-0.03

2.58

3.32

Mn

0.16

0.05

16.52

16.88

Zn

0.27

0.02

4.25

4.63

F-value

CC

RMSE

Size

Train

Xval

Train

Xval

SD

0.05

0.36

0.10

0.66

0.72

11

TotalN

0.51

0.36

1.09

1.21

TotalP

0.36

0.06

14.22

17.47

F-value

CC

RMSE

Size

Train

Xval

Train

Xval

Na

0.1

0.39

0.10

1.93

2.17

17

K

0.46

0.09

0.59

0.71

Mg

0.47

0.24

2.50

2.81

Cu

0.49

0.01

2.42

3.15

Mn

0.19

0.02

16.43

16.91

Zn

0.33

0.03

4.16

4.65

We can also compare the regression trees and the multi-target regression tree by their size (total number of internal nodes and leafs). The size of the multi-target regression tree is 11, while the size of the regression trees ranges from 9 (for Mn parameter) to 29 (for Conductivity and Mg). The total size of all single-target trees is much larger that the size of the multi-target tree, if we learn a regression tree for each of the 18 physical-chemical parameters. The size of the trees obtain from the eutrophication parameters (SD, Total Nitrogen, Total Phosphorus), is quite different, for regression trees we have total of 63 leafs, but for multi-target trees we have total of 33 leafs all together. The metal set multi target regression trees all together have smaller size than the single target regression trees for each parameter.

Multivariate analyses such as principal components analysis (PCA), canonical correlation analysis and cluster analysis were used to determine the relationship between the distribution of diatom species and gradients in salinity and other physical factors within the estuary in studies by McIntire (1973, 1978) in studies of benthic diatoms in Yaquina Estuary, Oregon. Similar studies include Main and McIntire (1974), Moore and McIntire (1977), and Whiting and McIntire (1985). Descriptive multivariate techniques, including Q-mode cluster analysis and PCA, were employed to analyze the data. Juggins (1992) developed a ‘salinity transfer function’, using weighted-averaging methodology, by analyzing surface sediments and living source communities of diatoms. Although these techniques provide very useful insights in the data, they are limited in terms of interpretability. On the other hand, multi-target regression trees offer models that are readily interpreted. MTRTs are able to identify clusters of samples that are similar in terms of physico-chemical water quality and describe them in terms of diatom species. To summarize, the multi-target regression trees are models that are easily interpretable, with reasonable size and predictive performance.

6. Conclusion (needs more work)

Summary. In this paper, we applied machine learning methodology, in particular multi-target regression trees, to predict the chemical parameters of the environment using the diatom community in Lake Prespa. We managed to express the relationships between the physico-chemical parameters and the diatom abundance. The obtained trees reveal some diatoms as indicators for specific physico-chemical parameters (i.e., for eutrophication or metal contamination). We first assessed the predictive performance of the obtained models, which were then interpreted for content. The interpretation was done by a domain expert, a biologist who has studied the diatoms in Lake Prespa and collected and processed the samples (S. Krstić).

A comparison of the models was then performed along two dimensions. First, we compare the performance of the models both on training data and unseen data. Second, we compare the models by their interpretation in terms of structure and content. Regarding the performance, in our case, MTRTs achieve slightly better correlation coefficients than RTs. The presented methodology of multi-target regression trees has several advantages with respect to the more commonly used approach of single-target regression trees. Namely, the MTRTs provide knowledge about all targets and, in our case, identify the diatom species that are present in the water samples with specific chemical conditions. In contrast to this, using the traditional approach one would have to construct a separate model for each chemical parameter and to summarize over the multiple models, which is not a trivial task.

Predictive power. The predictive power of the models on unseen cases is weak (as estimated with 10-fold cross validation). Since we suspected that over-fitting might play an important role in this, the ‘F-test pruning’ was applied to prevent over-fitting. However, despite this the predictive power remained poor. On the other hand, the performance on the training data and thus the explanatory power is much better; the tests that are in the nodes produce statistically significant reduction in the variance at a given significance level.

To investigate the limits of predictive performance for the data at hand, we also built ensembles of tree-based models. These are well known for their predictive power and are top performers, at the cost of producing models that are not easy to interpret. This yielded predictive performance that was better than that of a single tree, but still not that high (maximum correlation reached was 0.6).

We can thus conclude that the low predictive performance achieved is not a consequence of using an inappropriate methodology, but rather a consequence of the difficulty of the problem addressed. The modelling problem at hand is very difficult, because the lake is a complex ecosystem and the data available was of limited quantity and quality. In order to obtain models with better predictive power, more measurements are needed. These measurements should include additional locations, a longer period of observation and a wider range of measured environmental parameters.

Model interpretation. Multi-target regression trees are a special case of predictive clustering trees, where the tree is viewed as a hierarchy of clusters. In our study, we focus on the clustering part (how the models describe the training data). The developed models clearly reflect and improve the hitherto known ecological preferences of the diatom species in Lake Prespa. The dominant lake diatom flora is composed of species indicative for increased eutrophication levels and their abundance is directly related to specific physico-chemical parameters. Models built to predict all 18 physico-chemical parameters, eutrophication components and metal parameters are used to investigate the diatoms’ ability to reflect the ecological changes.

The developed models clearly reflect the factors that strongly influence the abundance of the dominant species and the entire diatom community in Prespa Lake. Cyclotella ocellata (COCE) is the most abundant diatom species, which can be found in almost all the models, but mostly in the eutrophication ones, which indicates its ability to reflect the environmental changes related to the eutrophication.

Conclusion. Using multi-target machine learning techniques methods for prediction, we learn models that contribute to our ecological knowledge about the physico-chemical conditions in the lake using diatoms abundance as bioindicators. The predictive power of the learned models is not high, but they provide useful explanation related to the existing ecological knowledge. It is obvious that several of the diatoms are indicators of important processes and this is confirmed by the biological expert and the known literature. Multi-target decision trees are nice illustration how several parameters at once influence the diatom community.

Multi-target regression trees have been used so far to investigate terrestrial communities, e.g., soil insects (Demšar et al. 2006); to predict chemical parameters of river water quality from bioindicator data (Blockeel et al. 1999) and to predict the condition/quality of indigenous vegetation (Kocev et al. 2009). However, to our knowledge, this is the first use of multi-target regression trees to predict the environmental variables from the composition of diatom communities in a lake ecosystem.

Future work. In the future, we plan to investigate several research scenarios. One of these is to use the diatom community to reconstruct past ecological changes in the lake history, by exploiting bioindicators’ abundance ability to reveal the past physico-chemical conditions. This could lead to understanding the effect of the human population on the lake, as well as understanding the impact on climate change and etc. In other studies the diatoms as bioindicators are widely used to reconstruct certain pattern regarding pH, Total Phosphorus and salinity indicative stages. Another possibility is to represent the diatom community together with its taxonomic structure. The taxonomic structure of the community could then be predicted with hierarchical multi-label classification (Vens et al. 2008) approaches.

References

Absil M.C.P. and Van Scheppingen Y. 1996. Concentration of Selected Heavy Metals in Bentic Diatoms and Sediment in the Westershelde Estuary. Bulletin of Environmental Contamination and Toxicology, 56: 1008-1015.

Admiraal W., Ivorra N., Jonker M., Bremer S., Barranguet C. and Guasch H., 1997. Distribution of diatom species in metal polluted Belgian-Dutch river: An experimental analysis. p.240-244. In: Use of algae for monitoring rivers III, edited by Prygiel J, Whitton BA, Bukowska.

Ahner B.A., Kong S. and Morell F.M.M., 1995. Phytochelatin production in marine algae. 1. An interspecies comparison. Limnology and Oceanography, 40(4): 649-657.

Ahner B.A. and Morell F.M.M. 1995. Phytochelatin production in marine algae. 2. Induction by various metals. Limnology and Oceanography, 40 (4): 658-665.

Battarbee R.W., Charles D.F., Dixit S.S. & Renberg I. (1999) Diatoms as indicators of surface water acidity. In: The Diatoms: Applications for the Environmental and Earth Sciences (Eds E.F. Stoermer & J.P. Smol), pp. 85–127. Cambridge University Press, Cambridge.

Blockeel, H., L. De Raedt, and J. Ramon (1998). Top-down induction of clustering trees. In Proc. Fifteenth International Conference on Machine Learning, p. 55–63. San Mateo, CA, Morgan Kaufmann.

Blockeel, H., and J. Struyf (2002). Efficient algorithms for decision tree cross-validation. Journal of Machine Learning Research 3:621–650.

Breiman, L., J.H. Friedman, R.A. Olshen, C.J. Stone (1984). Classification and Regression Trees. Wadsworth.

CEMAGREF (1982) Etude de Me´thodes Biologiques Quantitatives d’Appreciation de la Qualite´ des Eaux. Rapport Q.E. Lyon-A.F.B. Rhoˆne-Mediterranne´e-Corse. 218 pp.

[14] Davis, R.B., Norton, S.A. “Paleolimnological studies of human impact on lakes in the United States, with emphasis on recent research in New England.”, Journal of Polskie Archiwum Hydrobiologii; Vol.15 (1/2), pp 99-115, 1978

Descy J-P. (1979) A new approach to water quality estimation using diatoms. Nova Hedwigia, 64, 305–323.

Garofalakis, M., D. Hyun, R. Rastogi, and K. Shim (2003). Building decision trees with constraints. Data Mining and Knowledge Discovery 7(2):187–214.

Genter R.B., 1996. Ecotoxicology of Inorganic Chemical Stress to Algae. p.403-468. In: Algal Ecology, edited by Stevenson R.J., Bothwell M.L. 1and Lowe R.L., Academic Press.

Ivorra N, Hettelaar J., Tubbing G.M.J., Kraak M.H.S., Sabater S. and Admiraal W. (1999). Translocation of Microbenthic Algal Assemblages Used for In Situ Analysis of Metal Pollution in Rivers. Archive of Environmental Contamination and Toxicology, 37: 19-28.

Ivorra N.C., 2000. Metal induced succesion in bentic diatom consortia. Doctor Disertation.Faculty of Sciences, University of Amsterdam, The Netherlands. 161p.

Kelly M.G. (2006) A Comparison of Diatoms with other Phytobenthos as Indicators of Ecological Status in Streams in Northern England. Proceedings of the 18th International Diatom Symposium, pp. 139–151. Poland, September 2004. Biopress, Bristol.

Kelly M.G. & Whitton B.A. (1995) The trophic diatom index: a new index for monitoring eutrophication in rivers. Journal of Applied Phycology, 7, 433–444.

Kelly M.G., Juggins S., Bennion H., Burgess A., Yallop M., Hirst H., King L., Jamieson J., Guthrie R. & Rippey B. (2006) Use of Diatoms for Evaluating Ecological Status in UK Freshwaters. Draft final report to Environment Agency. Bristol. 170 pp.

Kelly M.G., Juggins S., Guthrie R., Pritchard S., Jamison J., Rippey B., Hirst H. and Yallop M. (2008) Assesment of ecological status in UK rivers using diatoms. Freshwater Biology, 53, 403-422.

Krstic S. and Levkov Z. (2007): Saprobiological and trophic models for Lake Prespa (saprographs) for use in similar regions and its application for evaluation of Ecological Quality Ratios (indicators). EC-FP6 project "TRABOREMA", EC-Project Contract No. INCO-CT-2004-509177, Deliverable 3.3., 98 pp.

Krstic S., Svircev Z., Levkov Z. and Nakov T. (2007): Selecting appropriate bioindicator regarding the WFD guidelines for freshwaters - a Macedonian experience. International Journal on Algae 9(1), 41-63.

Levkov Z. and Krstic S. (2002): Use of algae for monitoring of heavy metals in the River Vardar, Macedonia. Mediterranean Marine Science, 3/1, 99-112.

Levkov Z., Krstić S., Metzeltin D. and Nakov T. (2006). Diatoms of Lakes Prespa and Ohrid (Macedonia). Iconographia Diatomologica 16: 603.

Levkov, Z., Saul, B., Krstic, S., Nakov, T. and Ector, L. (2007): Ecology of benthic diatoms from Lake Prespa, Macedonia. Archiv für Hydrobiologie: Supplement/ Algological Studies 124: 71-83.

Lomax, R. G. (2007). Statistical Concepts: A Second Course, Routledge, ISBN 0-8058-5850-4

Lowe RL, Pan Y. Benthic algal communities as biological monitors. In: Stevenson RJ, Bothwell ML, Lowe RL, editors. Algal ecology of freshwater benthic ecosystems, aquatic ecology series. Boston: Academic Press, 1996. p. 705–39.

Main, S.P.,& McIntire, C.D. (1974).The distribution of epiphytic diatoms in Yaquina Estuary, Oregon (U.S.A.).Botanica Marina, 17, 88–99.

McCormick PV, Cairns Jr J. Algae as indicators of environmental change. J Appl Phycol 1994;6:509–26.

Moore,W.W.,& McIntire, C.D. (1977). Spatial and seasonal distribution of littoral diatoms in Yaquina Estuary, Oregon (U.S.A.).Botanica Marina,20, 99–109.

Patrick, R., Reimer, C.W.” The diatoms of the United States, exclusive of Alaska and Hawaii. Volume I: Fragilariaceae, Eunotiaceae, Achnanthaceae, Naviculaceae”, Journal of Academy of Natural Sciences of Philadelphia, Monograph No. 13, pp 688, 1966

Potapova M.G., Charles D.F., Ponader K.C. & Winter D.M. (2004) Quantifying species indicator values for trophic diatom indices: a comparison of approaches. Hydrobiologia, 517, 25–41.

Rott E., Pipp E., Pfister P., van Dam H., Ortler K., Binder N. & Pall K. (1999) Indikationslisten fur Aufwuchsalgen in Osterreichischen Fliessgewassern. Teil 2: Trophieindikation. Bundesministerium fuer Land- und Forstwirtschaft, Wien. 248 pp.

Stroemer, E.F., and J. P. Smol (2004). The diatoms: Applications for the Environmental and Earth Sciences, Cambridge University Press.

Salomoni, S., O. Rocha, V. Callegaro and E. Lobo (2006). Epilithic diatoms as indicators of water quality in the Gravataí River, Rio Grande do Sul, Brasil. Hydrobiologia 559: 233-246.

Struyf, J., and S. Džeroski (2006). Constraint based induction of multi-objective regression trees. In Proc. Fourth International Workshop on Knowledge Discovery in Inductive Databases, Revised, Selected and Invited Papers, LNCS 3933: 222–233.

TRABOREMA Project WP3, EC FP6-INCO project no. INCO-CT-2004-509177, 2005-2007

Water Framework Directive (WFD), Water Quality - Sampling - Part 2: Guidance on sampling techniques (ISO 5667-2:1991), 1993.

Whiting,M. C., & McIntire, C.D. (1985).An investigation of distributional patterns in the diatom flora of Netarts Bay, Oregon, by correspondence analysis. Journal of Phycology, 21, 655–61.

APPENDIX

Table A1. The names and acronyms of the 116 diatoms whose abundances were used in the data analysis. The top 10 most abundant are given in boldface.

Diatom

Acronym

Diatom

Acronym

Amphora aequalis

AAEQ

Gomphonema minutum

GMIN

Achnanthes sp.

ACH

Gomphonema olivaceum

GOLIV

Achnanthidium clevei var. balcanica

ACCLB

Gomphonema parvulum

GPRV

Achnanthidium clevei

ACCL

Gomphonema pumilum

GPUM

Amphora copulata

ACOP

Gomphonema olivaceoides

GQDR

Amphora fogediana

AFOG

Gomphonema sarcophagus

GSRC

Achnanthes lacunarum

ALAC

Gomphonema tergestinum

GTRG

Amphora inariensis

AMIN

Gyrosigma macedonicum

GYMAC

Achnanthidium minutissimum

AMSS

Hannea arcus

HARC

Amphora ovalis

AOVAL

Hantzschia amphioxys

HAYX

Amphora pediculus

APED

Hippodonta rostrata

HROS

Amphora thumensis

ATHUM

Luticola mutica

LMUT

Aulacoseira granulata

AUGR

Meridion circulare var. constrictum

MCCC

Amphora veneta

AVEN

Meridion circulare

MCRC

Caloneis schumaniana

CSCH

Martyana martyi

MMRT

Cavinula scutelloides

CSCU

Melosira varians

MVAR

Cocconeis disculus

CDIS

Nitzschia alpina

NALP

Cocconeis placentula

CPLA

Navicula antonii

NANT

Cocconeis placentula var. euglypta

CPLE

Navicula capitatoradiata

NCPR

Cocconeis placentula var. lineata

CPLL

Navicula cryptocephala

NCRPH

Cocconeis neothumensis

CNTHUM

Nitzschia dissipata

NDISS

Cyclotella ocellata

COCE

Neidium dubium

NDUB

Cyclotella meneghiniana

CMHGN

Navicula gregaria

NGRG

Cymatopleura elliptica

CELL

Navicula hasta

NHAS

Cymbopleura juriljii

CJUR

Navicula krsticii

NKRS

Cymbella affiniformis

CAFF

Navicula lanceolata

NLAN

Cymbella lanceolata

CLAN

Nupela lapidosa

NLAP

Cymbella neocistula

CYNC

Nitzschia linearis

NLIN

Diatoma angusticostata

DANG

Navicula praetarita

NPRA

Denticula tenuis

DCNT

Navicula prespanensis

NPRE

Diadesmis gallica var. perpusilla

DGLPS

Navicula protracta

NPTR

Diploneis mauleri

DMAU

Nitzschia recta

NREC

Diatoma mesodon

DMES

Navicula reinhardtii

NRERH

Diploneis modica

DMOD

Navicula rotunda

NROT

Diploneis ovalis

DOVAL

Navicula subhastatula

NSHA

Epithemia adnata

EADN

Navicula subrotundata

NSROT

Encyonema caespitosum

ECAES

Nitzschia subacicularis

NSUA

Encyonema minutum

EMIN

Navicula tripunctata

NTPT

Encyonopsis microcephala

ENCYM

Navicula viridulacalcis

NVCAL

Encyonema silesiacum

ESLS

Navicula viridula

NVIR

Epithemia sorex

ESOR

Orthoseira roseana

OROS

Fragilaria capucina var. vaucheriae

FCAPV

Placoneis balcanica

PBAL

Fragilaria capucina

FCAPV

Pinnularia borealis

PBOR

Fallacia ochridana

FOCH

Placoneis minor

PCLM

Fragilaria parasitica

FPAR

Placoneis elginensis

PELG

Frustulia vulgaris

FVUL

Planothidium lanceolatum

PLLA

Gomphonema clavatum

GCLA

Planothidium rostratum

PLLR

Geissleria decussis

GDEC

Placoneis neoexigua

PNEO

Gomphonema italicum

GITA

Pseudostaurosira brevistriata

PSBR

Table A1 (ctd). Diatom names and acronyms. The top 10 most abundant are given in boldface.

Diatom

Acronym

Diatom

Acronym

Pinnularia subcapitata

PSCP

Surirella angusta

SANG

Rhoicosphenia abbreviata

RABB

Surirella minuta

SMIN

Rhopalodia gibba

RHGB

Sellaphora perbacilloides

SPBA

Reimeria sinuata

RSIN

Sellaphora pupula

SPUP

Surirella angusta

SANG

Stauroneis gracilis

SRGR

Surirella minuta

SMIN

Staurosira construens var. binodis

STCB

Sellaphora perbacilloides

SPBA

Staurosira construens

STCO

Sellaphora pupula

SPUP

Staurosira construens var. venter

STCV

Placoneis neoexigua

PNEO

Stauroneis phoenicenteron

STPHN

Pseudostaurosira brevistriata

PSBR

Staurosirella pinnata

STPNN

Pinnularia subcapitata

PSCP

Stauroneis smithii

STSM

Rhoicosphenia abbreviata

RABB

Tryblionella angustata

TANG

Rhopalodia gibba

RHGB

Tabellaria flocculosa

TFLOC

Reimeria sinuata

RSIN

Ulnaria ulna

UULN

Table A2. Performance (Correlation coefficient and RMSE) of the ensembles of regression trees (Bagging and Random Forest) on training data and estimated with 10-fold cross validation. - STRT

Bagging

Random Forest

CC

RMSE

CC

RMSE

Train

Xval

Train

Xval

Train

Xval

Train

Xval

Temp

0.91

0.60

3.01

5.28

0.90

0.36

2.81

7.15

SatO

0.78

0.41

12.07

17.29

0.80

0.33

11.33

19.53

SD

0.92

0.21

0.35

0.69

0.91

0.18

0.29

0.89

Conduc

0.84

0.42

15.71

25.41

0.85

0.31

14.45

31.53

pH

0.84

0.06

0.38

0.67

0.86

-0.03

0.33

0.83

NO2

0.87

0.24

0.03

0.05

0.79

0.21

0.03

0.05

NO3

0.87

0.50

1.13

1.85

0.88

0.22

0.99

2.49

NH4

0.82

0.25

0.11

0.17

0.82

0.18

0.10

0.21

TotalN

0.85

0.40

0.71

1.18

0.86

0.20

0.65

1.53

OrgN

0.81

0.24

0.69

1.09

0.81

0.13

0.65

1.30

SO4

0.89

0.02

13.33

26.10

0.77

0.10

14.72

29.56

TotalP

0.87

0.17

8.52

15.68

0.86

-0.04

7.81

20.97

Na

0.84

0.35

1.25

1.96

0.83

0.17

1.18

2.51

K

0.88

0.21

0.36

0.66

0.83

0.16

0.36

0.82

Mg

0.89

0.43

1.45

2.55

0.91

0.23

1.19

3.33

Cu

0.78

0.25

1.83

2.75

0.77

0.06

1.78

3.47

Mn

0.38

0.08

15.54

17.05

0.37

0.12

15.54

17.06

Zn

0.82

0.23

2.75

4.34

0.84

0.16

2.42

5.31

Table A3. Performance (Correlation coefficient and RMSE) of the ensembles of multi-target regression trees (Bagging and Random Forest) on training data and estimated with 10-fold cross validation for all parameters.

Bagging

Random Forest

CC

RMSE

CC

RMSE

Train

Xval

Train

Xval

Train

Xval

Train

Xval

Temp

0.90

0.58

3.25

5.37

0.88

0.36

3.18

6.94

SatO

0.77

0.36

12.70

17.51

0.69

0.21

13.47

20.50

SD

0.90

0.12

0.42

0.70

0.76

0.13

0.46

0.81

Conduc

0.82

0.42

17.09

25.21

0.79

0.26

17.14

30.16

pH

0.82

0.07

0.42

0.66

0.76

0.04

0.41

0.81

NO2

0.88

0.25

0.03

0.04

0.78

0.12

0.03

0.05

NO3

0.87

0.50

1.17

1.85

0.84

0.31

1.16

2.29

NH4

0.82

0.18

0.12

0.17

0.70

0.07

0.13

0.21

TotalN

0.84

0.42

0.75

1.16

0.78

0.29

0.80

1.38

OrgN

0.80

0.28

0.72

1.07

0.71

0.14

0.78

1.28

SO4

0.91

0.01

12.96

24.75

0.81

0.07

13.45

31.43

TotalP

0.88

0.21

9.01

15.05

0.79

0.09

9.32

18.62

Na

0.82

0.36

1.34

1.96

0.78

0.26

1.32

2.27

K

0.87

0.22

0.40

0.65

0.77

0.09

0.42

0.79

Mg

0.88

0.45

1.58

2.55

0.81

0.29

1.66

3.11

Cu

0.78

0.20

1.89

2.77

0.75

0.02

1.86

3.39

Mn

0.39

0.08

15.57

16.97

0.34

0.06

15.73

17.48

Zn

0.80

0.20

2.95

4.36

0.74

0.12

2.98

5.23

Table A4. Performance (Correlation coefficient and RMSE) of the ensembles of multi-target regression trees (Bagging and Random Forest) on training data and estimated with 10-fold cross validation for eutrophication parameters.

Bagging

Random Forest

CC

RMSE

CC

RMSE

Train

Xval

Train

Xval

Train

Xval

Train

Xval

SD

0.92

0.15

0.37

0.70

0.88

0.04

0.34

0.93

TotalN

0.85

0.43

0.73

1.16

0.83

0.35

0.72

1.34

TotalP

0.88

0.22

8.69

15.16

0.83

0.05

8.46

20.18

Table A5. Performance (Correlation coefficient and RMSE) of the ensembles of multi-target regression trees (Bagging and Random Forest) on training data and estimated with 10-fold cross validation for metals.

Bagging

Random Forest

CC

RMSE

CC

RMSE

Train

Xval

Train

Xval

Train

Xval

Train

Xval

Na

0.83

0.37

1.30

1.95

0.79

0.28

1.27

2.30

K

0.88

0.22

0.39

0.65

0.81

0.07

0.38

0.86

Mg

0.89

0.45

1.52

2.54

0.87

0.27

1.40

3.18

Cu

0.77

0.21

1.90

2.76

0.71

0.06

1.95

3.38

Mn

0.39

0.07

15.55

17.03

0.33

0.08

15.80

17.23

Zn

0.81

0.21

2.88

4.35

0.76

0.02

2.87

5.64

Single Target Regression Tree for the Temperature

STRT for the Saturated Oxygen

STRT for Secchi Disk

STRT for Conductivity

STRT for pH

STRT for NO2

STRT for NO3

STRT for Total Nitrogen

STRT for OrgN

STRT for SO4

STRT for Total Phosphorus

STRT for Na

STRT for K

STRT for Mg

STRT for Cu

STRT for Mn

STRT for Zn

[

]

å

=

×

T

t

t

y

Var

N

1

_1295876976.unknown

predicting chemical parameteres of water ...kt.ijs.si/dragi_kocev/wqm/ecoinf_wqm_0.8.doc · web...

Documents