applying machine learning techniques to ecological data

Applying machine learning

techniques to ecological data

Georgios Petkos

Master of Science

Artificial Intelligence

School of Informatics

University of Edinburgh

2003

Abstract

This thesis is about modelling carbon flux in forests based on meterological variables

using modern machine learning techniques. The motivation is to better understand the

carbon uptake process from trees and find the driving factors of it, using totally auto-

mated techniques. Data from two British forests were used, (Griffin and Harwood) but

finally results were obtained only with Harwood because Griffin had spurious variables

in it. Both data sets presented significant challenges: missing values, noise and dimen-

sionality reduction. The missing value problem was addressed with the regularized EM

algorithm, whereas for filtering out noise, n-step moving averages was used. A range

of different ‘semi-wrapper’ and a filter method have been used for dimensionality re-

duction: forward selection, backward elimination, best ascent hill climbing, genetic

algorithms, evolutionary strategies and correlation-based feature selection. Modelling

was done with Multiple Linear Regression, Multilayer Perceptrons and Support Vec-

tor Regression. The best model found had at most 83% explained variance. Support

Vector Regression and Multilayer Perceptrons had almost the same performance and

were better than Multiple Linear Regression, since they managed to capture non-linear

details of the process.

i

Acknowledgements

I would like to thank the Bodossaki Foundation for funding my studies. It would not

be possible for me to be here without the Foundation’s help and I am grateful for the

chance that I was given.

I would also like to thank my supervisor, Dr. John Levine for his advice and support

when times were hard.

Many thanks to all the zata people: Jordi, Fransisco, Roman, Vicky, Alexandros,

Stathis! But most of all to Ioanna, Christophoros, Nikos, Giorgos. Finally, a big

thanks to Bill Steer, Jeff Walker, Ken Owen, Aaron Stainthorpe, Martin Powell, Andy

Craighan and Dan Swano for inspiration through the years.

ii

Declaration

I declare that this thesis was composed by myself, that the work contained herein is

my own except where explicitly stated otherwise in the text, and that this work has not

been submitted for any other degree or professional qualification except as specified.

(Georgios Petkos)

iii

To my mother that has been so tired through the years to teach me so many things.

iv

Table of Contents

1 Introduction 1

1.1 The problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Overview of the chapters . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Related work 3

2.1 Ecological Informatics . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Previous work on flux prediction problems . . . . . . . . . . . . . . . 4

3 The data 7

3.1 The Griffin data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 The Harwood data set . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Data cleaning 13

4.1 The missing value problem . . . . . . . . . . . . . . . . . . . . . . . 14

4.2 Methods for handling missing values . . . . . . . . . . . . . . . . . . 16

4.2.1 List-wise deletion . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2.2 Substitution with mean value . . . . . . . . . . . . . . . . . . 17

4.2.3 Nearest neighbor method . . . . . . . . . . . . . . . . . . . . 18

4.2.4 Regression methods . . . . . . . . . . . . . . . . . . . . . . 19

4.2.5 Other methods . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.3 The EM algorithm for gap filling . . . . . . . . . . . . . . . . . . . . 20

4.4 Suitability of the EM algorithm to our problem . . . . . . . . . . . . 23

4.5 Noise / smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

v

5 Dimensionality reduction 29

5.1 The dimensionality reduction problem . . . . . . . . . . . . . . . . . 30

5.2 The semi-wrapper methods . . . . . . . . . . . . . . . . . . . . . . . 32

5.2.1 Forward selection . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2.2 Backward elimination . . . . . . . . . . . . . . . . . . . . . 34

5.2.3 Best ascent hill climbing . . . . . . . . . . . . . . . . . . . . 35

5.2.4 Genetic search . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.2.5 Evolution strategies . . . . . . . . . . . . . . . . . . . . . . . 38

5.3 The filter method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.3.1 Correlation-based feature selection . . . . . . . . . . . . . . . 40

5.4 Results of the feature selection methods . . . . . . . . . . . . . . . . 41

5.4.1 Feature selection results for the Griffin data set . . . . . . . . 41

5.4.2 Feature selection results for the Harwood data set . . . . . . . 43

5.5 Discussion about the results of feature selection . . . . . . . . . . . . 45

6 Modelling 46

6.1 Modelling techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.1.1 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . 47

6.1.2 Multilayer Perceptrons . . . . . . . . . . . . . . . . . . . . . 48

6.1.3 Support Vector Regression . . . . . . . . . . . . . . . . . . . 51

6.2 Modelling design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.2.1 Parameter optimization . . . . . . . . . . . . . . . . . . . . . 56

6.2.2 Time series modelling . . . . . . . . . . . . . . . . . . . . . 57

6.2.3 Performance evaluation . . . . . . . . . . . . . . . . . . . . . 58

6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.3.1 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . 58

6.3.2 Multilayer Perceptron . . . . . . . . . . . . . . . . . . . . . 59

6.3.3 Support Vector Regression . . . . . . . . . . . . . . . . . . . 60

6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7 Conclusions and further work 63

7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

vi

7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

7.2.1 Data cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . 65

7.2.2 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . 66

7.2.3 Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7.3 General discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Bibliography 68

vii

List of Figures

3.1 Missing values in the Griffin data set . . . . . . . . . . . . . . . . . . 8

3.2 Missing values in the reduced Griffin data set . . . . . . . . . . . . . 10

3.3 Missing values in the Harwood data set . . . . . . . . . . . . . . . . 12

4.1 Filled parts for the Fc and the m Ustr L0 L0 variables of the Griffin

data set. The variables maintain a behaviour in the filled part similar

to the one that is observed in the present part . . . . . . . . . . . . . . 24

4.2 Filled parts for the tFc and the tMeanU variables of the Harwood data

set. The variables maintain a behaviour in the filled part similar to the

one that is observed in the present part . . . . . . . . . . . . . . . . . 25

4.3 Not smoothed, 3 points, 5 points and 10 points moving average for the

first 300 points of the variable tmvpd of the Harwood data set . . . . . 27

5.1 MSE as forward selection adds variables for the Griffin data . . . . . 42

5.2 MSE as forward selection adds variables for the not smoothed and 5

point smoothed Harwood data . . . . . . . . . . . . . . . . . . . . . 44

6.1 The least squares solution given a set of observations�xi � yi � . . . . . . 47

6.2 A linealy separable problem (left) and a non-linearly separable prob-

lem (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.3 The output of a neuron with a sigmoid transfer function. . . . . . . . 49

6.4 A MLP with one input layer, one hidden layer and one output layer . . 50

6.5 (a) Some separating hyperplanes and the maximum margin one (b) The

maximum margin hyperplanes H, H � and H � . . . . . . . . . . . . . 52

viii

6.6 Transformation from the input space to the feature space where the

problem is linearly separable . . . . . . . . . . . . . . . . . . . . . . 54

6.7 The epsilon function . . . . . . . . . . . . . . . . . . . . . . . . . . 56

ix

List of Tables

5.1 The forward selection algorithm . . . . . . . . . . . . . . . . . . . . 34

5.2 The backward elimination algorithm . . . . . . . . . . . . . . . . . . 34

5.3 The best ascent hill climbing algorithm . . . . . . . . . . . . . . . . . 35

5.4 A genetic algorithm for feature selection . . . . . . . . . . . . . . . . 37

5.5 Evolutionary strategy for feature selection . . . . . . . . . . . . . . . 39

6.1 The results for MLR . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.2 The results for MLPs . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.3 The results for SVR . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

x

Chapter 1

Introduction

1.1 The problem

The task of this study is to model carbon flux in a forest based on various physical

measurents. This problem is of crucial importance in ecology since better understand-

ing of the carbon uptake process from forests could potentially help to better face one

of the main ecological threats, the greenhouse effect. Although this is a much studied

problem by ecologists and biologists, the methods that are used are mainly non-flexible

parametric models that result in insufficiently good results. Our aim is to investigate

the application of more elaborated machine learning modelling techniques to the car-

bon flux prediction problem. Machine learning methods, and especially the ones that

we plan to use, Support Vector Regression and Multilayer Perceptrons, have the ability

to represent complex relationships between variables and therefore look promising for

our task.

The focus of this study is in producing models without use of prior knowledge about

the problem, in a totally automated way. That is, we have a collection of data and we

explore it to try to find good models. Thus, the problem is seen from the point of view

of a non-expert in relevant biological issues. We make no modelling decisions based

on knowledge of the process of photosynthesis or transpiration, for example. We just

explore collected data sets looking for models that explain the observed values for car-

bon flux.

1

Chapter 1. Introduction 2

Two data sets were used: the Griffin and the Harwood dataset. Focus was initially on

Griffin but we had to shift to Harwood. The challenges that these data sets presented

were many: missing values, noise, spurious variables, selection of relevant variables,

difficulties in training the models. All of these are thoroughly presented in the next

chapters, an overview of which is given below.

1.2 Overview of the chapters

Chapter 2 introduces the general field of Ecological Informatics, that this project is

involved in, and describes some previous work on the application of machine learning

techniques to carbon flux prediction in forests. We also point out some issues about

previous studies and their limitations.

Chapter 3 gives the basic information about the two data sets used, the Griffin and the

Harwood data set. The basic outline of data processing is sketched based on the prop-

erties of the data sets and we get a clear idea of the obstacles that we have to face to

find good models.

Chapter 4 discusses two important steps of preprocessing: handling missing values and

noise removal. Both are part of the data cleaning part of processing. This is necessary

for making modelling possible. It is also an extremely sensitive procedure since it can

alter the real information contained in the data and lead to wrong models.

Chapter 5 discusses the problem of dimensionality reduction and selection of relevant

features from the data. Various methods are presented. Finally we decide about the

features that will be used for the final modelling part.

Chapter 6 describes the machine learning modelling techniques that were used, Sup-

port Vector Regression and Multilayer Perceptrons, along with Multiple Linear Re-

gression that was used for feature selection and for comparative modelling. We also

discuss our modeling design choices and present the results of modelling.

Finally, chapter 7 critically discusses the whole process of data analysis, presents some

final conclusions for our study and sketches out some directions for further research

and improvement of the methods used.

Chapter 2

Related work

2.1 Ecological Informatics

The modelling problem of this project is involved in the novel field of Ecological In-

formatics. A definition of Ecological Informatics appearing in the web site of the

International Society For Ecological Informatics1 is:

Ecological Informatics is defined as interdisciplinary framework promot-ing the use of advanced computational technology for the elucidation ofprinciples of information processing at and between all levels of complex-ity of ecosystems - from genes to ecological networks - and aiding trans-parent decision-making in relation to important issues in ecology such assustainability, biodiversity and global warming.

Therefore, Ecological Informatics concerns the use of modern modelling and com-

puting techniques on problems of ecological interest. In this context, modern artifi-

cial intelligence techniques and in particular machine learning techniques should be

applicable. Indeed, machine learning techniques are becoming a useful tool for peo-

ple working in the field. Researchers have used artificial neural networks (Lek and

Guegan, 1999), evolutionary algorithms (Whigham and Recknagel, 2001), cellular au-

tomata (Parrot and Kok, 2000) and fuzzy logic (Chen et al., 2000), among others, to

various ecological modelling problems with good results.

Although Ecological Informatics is a novel field and there hasn’t been extensive work

1http://www.waite.adelaide.edu.au/ISEI/

3

Chapter 2. Related work 4

so far, it looks like there is a constantly growing interest on the application of various

machine learning techniques to ecology related problems. This is reflected in the activ-

ity of the International Society For Ecological Informatics, that has already organized

three conferences (1998, 2000 and 2002). The first conference focused on the applica-

tion of artificial neural networks to ecological modelling problems (Lek and Guegan,

1999). However, in the next two conferences, work with a much wider variety of meth-

ods was presented, indicating the growing interest of researchers of the area in novel

techniques (Recknagel, 2001).

2.2 Previous work on flux prediction problems

The range of ecosystems on which various such modelling methods have been success-

fully applied is quite big: lakes, rivers, forests and many more. It seems promising that

machine learning methods can be useful for a large variety of problems. Nevertheless,

there is not much previous work on problems related with flux prediction in forests.

Below we outline some related studies that were found.

Artificial neural networks were used in (Vrugt et al., 2002) to identify the main driving

factors contained in a small set of variables that determine the levels of forest floor

water transpiration and total forest water transpiration. The best models for floor for-

est water transpiration achieved about 80% explained variance using as inputs global

radiation, air temperature and average water content between 0 and 2 meters under

the surface of the ground, whereas the best models for total forest water transpiration

achieved 84% explained variance using as inputs global radiation, air temperature and

average water content between 0 and half meter under the surface of the ground.

Another example of application of a machine learning method to a water transpiration

problem can be found in (Dekker et al., 2001). In this study the residual of a para-

metric model for forest transpiration (Single Big Leaf) was modeled using artificial

neural networks. It was found that part of the trend in the residuals was explained from

wind direction and speed. These were not included in the parametric model. It was

also found that a big part of the residuals could be attributed to noise produced by the


measurement method.

A more related to our problem study can be found in (van Wijk and Bouten, 1999).

van Wijk models water and carbon flux of a forest using artificial neural networks. He

is interested in finding site-independent models. This is actually the ideal goal that we

couldn’t go for in our study. The inputs to the artificial neural network he used were:

radiation, temperature, vapour pressure deficit and time of day. He used different com-

binations of these and also for the carbon models he used a ‘Leaf Area Index’ variable

in order to make site-independent prediction possible. He also did a lot of prepro-

cessing, removing data points that would probably be prone to noise, making the task

easier. Water flux predictions were more accurate than carbon flux predictions. The

best model for water flux had 90% explained variance and the best model for carbon

flux had 87% explained variance. It is worth noting though that the data used in that

model were limited to a period of 41 days. Getting good results in this case was not

very difficult since there are not significant seasonal changes and probably carbon flux

maintains a regular behavior in such a short amount of time. On the other hand, mod-

elling the process for a longer period of time, like in our experiments, is much more

difficult since big changes in the behavior of the forest appear at various times of the

year according to climatic changes.

Finally, (Stubbs, 2002) is an MSc thesis dealing with carbon flux prediction. In this

study, we can find the first application of Support Vector Regression to a carbon flux

prediction problem. He also used Multiple Linear Regression for comparison. The

variables used were the same that van Wijk used except for Leaf Area Index and var-

ious combinations between them were tried. The data set used was the Harwood data

set, one of the data sets that were used here also and about will be presented in the

next chapter. The best Support Vector Regression model found explained 89% of the

variance but again it was limited in data that corresponded to observations of a short

period, about one month and a half.

In all those studies, various combinations of only three or four basic variables were

used for prediction. In all of them, there was some prior knowledge about what the

important factors could be for modelling the carbon uptake process. The different ap-

proach that our study introduces is that we use no prior knowledge in model building.


We do not choose the input variables of our models using any knowledge about the

process and we just search for the critical variables.

Chapter 3

The data

Two data sets were used, the Griffin data set and the Harwood data set. Our initial

intention was to use the Griffin data set and face the big challenges it presented. Un-

fortunately though, at the end of preprocessing, it was found that it contained spurious

variables. The exact details of this are unknown but at least some of the variables were

found to be just duplicates of others with very little statistical processing. This made

all results obtained this far invalid. Finding the spurious variables and excluding them

could be possible but very expensive. Thus, given the short amount of time available

when this was discovered, it was decided to stop working with the Griffin data set. At

that point, we switched to the smaller Harwood data set and repeated the preprocessing

steps. Nevertheless, in this text we will describe work done on both data sets because

various interesting issues appeared for both of them. In this chapter we will discuss

their basic properties and sketch briefly further processing steps for both of them.

3.1 The Griffin data set

The Griffin data set1 comes from Griffin Forest, located at Aberfeldy in Scotland.

This is a forest with 20 year old trees and the dominant species is Sitka spruce (Picea

sitchensis). The collection of the data was part of the EUROFLUX project that collects

carbon flux and related meteorological data from various forests around Europe.

1http://carbodat.ei.jrc.it/data_arch_site_indiv.cfm?db_id=11

7

Chapter 3. The data 8

It is a large data set with 310 variables and 102645 datapoints. The meaning of the

variables was not known to us but they were supposed to be half-hourly measurements

spanning over a period of six years and physical quantities computed from the mea-

sured variables. We only knew the meaning of two variables. The one was carbon flux

and the other was a corrected version of carbon flux that should be excluded from our

analysis. Unfortunately, some of the other variables were also spurious.

Many issues arose when we decided to use the Griffin data set. First of all was its

size. Initially it was about 240 MB big. Sophisticated algorithms are computationally

very expensive for a data set of this magnitude. In addition, it had many missing val-

ues. About 40% of the values were missing. Considering the fact that the modelling

tools that we intend to use do not work with missing values, an appropriate method for

handling missing values was essential. Figure 3.1 can give an idea about the missing

value problem in the Griffin data set. Every variable is a row and time goes through the

horizontal axis. A white point represents missing values. It is obvious that in the begin-

200 400 600 800 1000 1200 1400 1600 1800 2000

50

100

150

200

250

300

Figure 3.1: Missing values in the Griffin data set

ning and the end of the 6 year period, data are rather sparse. Given the size of the data

set and the cost of further processing, it was decided to ignore those big empty parts.


One could argue that this is a bad decision. Doing this, we ignore information that is

present in the data set and this could mean that our models may not be as accurate as

they could be. If we consider it again though, cutting those two parts looks like a good

idea. This is because the data set describes a phenomenon in time. The phenomenon is

almost periodical. Similar phenomena appear daily and also seasonal changes should

affect carbon flux in the same way every year. Considering also the fact that after this

reduction we still have about 60000 datapoints over a period of more than three and

a half years it looks quite probable that information contained in the ignored part is

also present in the part that is kept. On the other hand, there could be some distinct

difference in the information contained in the part of the data set that was ignored and

the part that was kept. This could be, for example, just by luck or because the trees

get older. In addition, if we plan to do gap filling, we could expect that for datapoints

with very few present values, like the ones that appear in the beginning and the end of

the six years period, the missing values will probably not be very well estimated. This

could result in erratic models and further justifies the decision to discard the largely

empty parts in the beginning and the end of the sic years period.

Furthermore, it was found that a few variables had 0 variance, i.e. they had a constant

value. Although we did not know their meaning, it was obvious that these variables

could not be good predictors of carbon flux and were discarded. Also, the variables that

had missing values in a very big part of the datapoints were discarded. This is because

in gap filling we will make estimates of the missing values based on the relationship

between the variable that is missing and the others. If that variable is missing most

of the time, we don’t have a good sample to learn the relationship between that vari-

able and the others and our estimates will be poor. Probably also, relevant information

contained in those variables can be extracted from the others, since we already knew

that most of the variables have a strong relation at least with some of the others. We

have set the threshold to discard a variable to 97% missing values. All these reductions

look quite hazardous but there is a lot to be gained (a cleaner and smaller data set) and

probaly very little to be lost (some loss of information maybe).

Finally, the final data set had 281 variables and about 60000 datapoints. Missing values

are still present to quite a big extent. About 15% of the values are missing. This can


be seen in Figure 3.2. Handling missing values is still an important issue. This will be

200 400 600 800 1000 1200

50

100

150

200

250

Figure 3.2: Missing values in the reduced Griffin data set

discussed in the next chapter.

Another important issue is the high dimensionality of the data. There are too many

variables to use in a complex machine learning algorithm directly. The computational

cost of training is going to be very big. Also, it is quite probable that at least some of

the variables are not relevant with the modelling problem and should be excluded from

further analysis. Picking only the necessary variables alleviates the problem of over-

fitting. The modelling algorithm could be mislead by irrelevant variables and result

in modelling a false relationship between the target variable and the irrelevant ones.

Dimensionality reduction will be discussed in a later chapter.

Another issue was noise. Information about the source of noise in the data was not

available but since noise is almost always present in physical measurements, some-

thing should be done about it. This will be discussed later again, after we are done

with the missing value problem.

Concluding, we can say that the size of the Griffin data set, it’s high-dimensionality

and the big number of missing values that appear with no regularity posed a very big


challenge.

3.2 The Harwood data set

Harwood2 was a much smaller data set. It had observations for 3 different sites, all

located in Northumberland, Scotland. The objective of collecting these data was to

measure carbin flux flux on trees of different ages. The ‘d’ site had mostly weeds, the

‘h’ site had 7 year old trees and the ‘t’ site had 30 year old trees. For the ‘d’ site we had

20, for the ‘h’ site we had 19 and for the ‘t’ site we had 18 variables measured. This

time there were no spurious variables and this was checked by plotting the variables

together. For each of the sites there were 26581 datapoints, each one corresponding to

a half-hourly measurement. The data set covers a period of about 18 months.

Similarly to the Griffin data set, the Harwood data set had quite a lot of missing values.

This can be seen in Figure 3.3. Especially the ‘d’ site and the ‘h’ site are very sparse.

The data for the ‘t’ site was the most complete. About 80% of the values are present.

Therefore, it was decided that the data for the ‘t’ site would be used. Here we did not

consider reducing directly the dataset because there is not a part in time that is very

sparse and also there are not very sparse or with very low variance variables. Since this

is a smaller dataset, something like that could lead to significant loss of information.

Handling the missing values was again essential before proceeding to modelling and

noise removal was also an issue that will have to be considered. In addition, dimen-

sionality reduction, although not essential anymore from the computational cost point

of view, should be performed in this data set as well. We believe that not all variables

are important for modelling and therefore better results will be obtained if we pick the

best of them.

Concluding, the Harwood data set was rather small and had different properties than

Griffin. The problem was easier but still quite challenging.

2http://www.bgc-jena.mpg.de/public/carboeur/sites/harwood.html


0.5 1 1.5 2 2.5

x 104

5

10

15

20

0.5 1 1.5 2 2.5

x 104

5

10

15

0.5 1 1.5 2 2.5

x 104

5

10

15

d−site

h−site

t−site

Figure 3.3: Missing values in the Harwood data set

Chapter 4

Data cleaning

Missing values was a major issue in this study. The Griffin data set originally had

about 40% missing values (Figure 3.1) and after the partial reduction this decreases to

about 15% (Figure 3.2). A complete data set was needed for both the dimensional-

ity reduction and the modelling algorithms, therefore a method for handling missing

values should be used. Considering the fact that an inappropriate method can bias

the information contained in the data set and further processing can result in incorrect

models, it is clear that handling the missing values is a very important part of data

preprocessing.

For the Griffin data set, a few approaches were tried (simple linear modelling, artificial

neural networks and a nearest neighbor estimator) but finally a regularized Expectation

Maximization (EM) algorithm (Schneider, 2001) was used. For the Harwood data set,

only the EM algorithm was used, because it had already proved to be a good solution

on the Griffin data set and gave, based on basic statistics and visual inspection with

plots, seemingly good results.

Although we cannot be really sure that a specific method does not distort the actual

information contained in the data set, we have some faith that the bias introduced is

not significant since this is considered to be one of the best gap filling methods.

Another important issue was the presence of noise. Better models can be found if we

filter out noise in a suitable way. Smoothing the data and trying to fit models using the

smoothed data is an option that should be tried.

In this chapter we will discuss both issues. Both are data cleaning tasks, quite im-

13

Chapter 4. Data cleaning 14

portant in data preprocessing. First we will talk about the nature of the missing value

problem in general and in the data sets of this study specifically. We will next briefly

examine some commonly used methods for handling missing values and their appli-

cability to our problem. Next we will give a short description of the regularized EM

algorithm that was actually used to get the complete data sets used for the rest of this

project and discuss about its suitability to these data sets. Finally we will talk about

filtering out the noise from the data.

4.1 The missing value problem

The presence of missing values is a very important issue in data preprocessing. There

are not many modelling tools that can deal with missing values (decision trees can deal

with missing values for example), therefore handling missing values is an essential

preprocessing task if a model that cannot directly handle missing values is to be used.

Handling missing values is a very tricky problem though. The modeller must have

some confidence that the method he uses doesn’t harm much the information contained

in the observed values, i.e. the method used doesn’t add patterns in the completed

data set that are not present in the incomplete data set. We want just to maintain the

information contained in the incomplete data set and don’t want to add any artificial

information generated from the method used. If an inappropriate method is used, then

the results obtained from subsequent processing steps will be biased from this method

and they will not be reliable. Obviously this is not a desirable situation but it turns out

that it is very difficult to be sure about the suitability of a specific method.

When having to cope with a missing value problem we have to consider some basic

things. First we have to think if the data is missing at random (MAR), i.e. if there

is not a systematic reason for the absence. For example, in a data set collected from

job applications and the previous employer field is left empty, we will probably not

consider that this is just missing at random. There is probably some reason for the

absence of the data and this conveys information. If the data is MAR then there is

no information in the absence of the data and we can fill in the missing values with

estimates based on the observed data without losing any information. In our case, the


data is MAR, there is no hidden information in the absence of the data since the reason

for their absence is unavailability of physical measurements. This is valid for both data

sets. Thus, we can fill in the missing values with estimates.

In this step, we also have to ask the question if the missing variables are related with

the others that are not missing. If they are not, we can not predict the missing values

based on the others in a consistent way, the information is really missing. On the other

hand, if they are related there are some methods that can be used and will be described

in the next section. In our case we believe that there are strong relations between the

variables and therefore we can model the missing values based on the variables that

are present.

The previous is not true if the data is periodic and therefore missing values can be

filled based on previous values of the same variable and not on other variables. Even

in this case though, if the gaps are big and there is a little noise, prediction becomes

very difficult and the estimates of missing values are klikely to be inaccurate. In both

data sets, although most variables present some periodic patterns, this approach would

not be method because we have quite big gaps. Another thing that we have to consider

when we deal with a missing value problem is the cost of the possible solution. There

is a trade-off between the time required to compute estimates for the missing values

and the quality of the estimates. We have to make a compromise between the two. The

cost of the method used is determined from a variety of factors. Some of them are

examined below.

First is the size of the data set. This can be very important for choosing a method to

handle missing values. Not all gap filling algorithms scale well to bigger data sets,

whereas the time needed for some algorithms grows very fast with the size of the data

set. The application of some algorithms may even be totally infeasible for large data

sets. The Griffin data set is quite big so ideally we would like to use a method that is

not too expensive for a large data set but at the same time produces sufficiently good

results. The situation is not the same problem with the Harwood data set, it is many

times smaller than Griffin and an expensive but accurate method could be used easier.

Another issue, related to the cost of the method used is the number of missing values

and the number of different patterns of missing values appearing in the data set. If the


number of missing values is not that big, we could use for a big data set a simpler but

possibly less accurate method than a more sophisticated one. In a case like that, further

processing of the data would probably not be largely affected. Of course, if the data

set is small and at the same time there are not many missing values, a sophisticated

but expensive method should probably be used. In the Griffin data set, initially 40%

of the values were missing and after the partial reduction this was decreased to about

15%, which is still a quite big fraction of the data set. Therefore we should prefer to

use a method that is going to give quite good estimates. The case is the same with the

Harwood data set, we have 20% missing values and we need quite good estimates for

them. Also, the number of different missing value patterns is quite important since it

can increase the complexity of many gap filling algorithms. That was a major problem

in some of the attempts for the Griffin data set as it will be described in the next section.

Concluding, we can say that the missing value problem in this study was not trivial for

the Griffin data set because of the size of the data set, the number of missing values and

the absence of a regular missing value pattern. The size of the data set directly makes

the problem computationally expensive, the number of missing values requires very

good estimates and probably an expensive method whereas the absence of a regular

missing value pattern ultimately makes many gap-filling methods too complicated. On

the other hand, the missing value problem for the Harwood data set was much simpler.

Althouth a big part of it had missing values, it was much smaller and did not have a

very big number of missing value patterns.

4.2 Methods for handling missing values

Many methods have been used for handling missing values. Each of them has its own

properties and therefore is suitable for some problems and unsuitable for others. Below

we will present some of the most widely used methods for handling missing values and

we will discuss their suitability to our problem.


4.2.1 List-wise deletion

In this method we just discard the datapoints that contain missing values keeping only

datapoints that contain all the values. This is a very commonly used method but it

has the main disadvantage that it largely biases the data set if a large part of the data is

missing. This is quite reasonable since it is quite probable that the datapoints discarded

contain information that is not present in other datapoints. This can lead to inaccurate

models.

In some sense we used this method by discarding the largely empty part of the Griffin

data set at the beginning and the end of the 6 year period but for the reasons mentioned

in the previous chapter, this was not an unreasonable decision. This would not be

possible in the reduced data set because there was not even one datapoint in it that

had all values available. Even if there were some, this would probably be disastrous

for the final data set used. This method is also not applicable to the Harwood data set

because, although there are datapoints that have all 18 variables present, they are not

enough and the remaining datapoints would probably not carry the necessary details of

the process we want to model.

4.2.2 Substitution with mean value

Here we just substitute each missing value with the mean value for the specific variable,

obtained from the values that are present. In the case of discrete variables, we just fill

in the missing value with the most numerous value. This is also a very commonly used

method. It is also very fast. Its main disadvantage is that it can also introduce a large

bias to the data unless there are not many missing values. It is obvious that this method

reduces the variance of the variables and therefore reduces the real information they

carry. In a data set with very few missing values, this could be a fast and sufficient

solution but not for our data sets that have 15% and 20% missing values.

There are a few varieties of this method that give slightly better results. Those methods

attempt to cluster datapoints according to the variables that are present and give to the

missing values the mean of the present values for that cluster. This is quite better

than the simple approach, since it preserves a part of the variance of the variables


but it is not very suitable in the case of the Griffin data set where there is a very big

number of missing value patterns. It is very difficult to decide which variables to use

for clustering when there is not a constant pattern of missing variables. We discuss

about this problem next, when we discuss about the nearest neighbor method. On the

other hand, this modified version could be suitable for the Harwood data set.

4.2.3 Nearest neighbor method

This is another commonly used method for gap filling. When a missing value is

encountered, the algorithm looks at the rest data for the most similar datapoint that

doesn’t have a missing value for the variable that is missing and then fills in the miss-

ing value with the value that this datapoint has. The similarity of two datapoints is

usually measured with the euclidean distance but other measures are sometimes used.

Calculation of the distance is based on the variables that are available for both in-

stances. If we have k features and xi declares the ith feature of the datapoint x, then the

euclidean distance between two datapoints x and y is given by:

k

∑i

�xi � yi � 2

This is a method that gives quite good results but is expensive for large data sets.

This approach has been tried but it couldn’t be efficiently applied to the Griffin data

set because it doesn’t have a regular missing value pattern. Most of the variables are

missing at some points and it is hard to find a sufficiently big subset of the datapoints

that has the same present variables as the datapoint that we want to fill in and at the

same time has a value for the variable that is missing in the other. It is clear that in

most cases, if this subset set of datapoints is quite small, the filled value will be a poor

estimate. Let’s suppose that we want to fill the missing values for a datapoint and let’s

refer with xp to the variables with present values and with xm to the variables with

missing values. We would have to look for the closest neighbors for the datapoint that

have not only xp present but also some of xm (in the Griffin data set we can’t have all

because we don’t have complete datapoints). We could consider to find the nearest

neighbors based on a subset of xp so that we can get a bigger subset of comparable

datapoints but still it is very difficult to figure out which variables to include and which


not. Thus, this approach was abandoned for the Griffin data set when those problems

became apparent in practise. This would not be a very big problem for the Harwood

data set but it wasn’t tried because of the short amount of time available. We decided

to use the regularized EM algorithm that had already given good results for the Griffin

data set (section 4.3).

4.2.4 Regression methods

This is a class of very popular techniques. Here we try to model the missing values xm

based on the observed values xp. Various regression methods are used for that purpose,

for example linear regression and artificial neural networks. Both linear regression and

artificial neural networks were tried for the Griffin data set but they both failed for

the same reason that the nearest neighbor approach failed. The difference here is that

we believed that the flexibility of the models would allow us to predict at least with

rough accuracy sufficiently good estimates for the missing values. Especially, it was

expected that the neural network, given its large modelling capacity, would be able

to give quite good results. The main disadvantage of these methods is that the model

used for infering missing values is usually noise free and therefore underestimates the

variance in the data.

Again, it was difficult to find a good subset of the datapoints to use as training examples

and estimate the parameters of the models. In most cases, the training examples were

too few and therefore it was hard to catch the real details of the relationships between

variables and perform good estimations. Using both kinds of models, the first results

showed that some totally out of range values were estimated for the missing values and

therefore, this approach was also abandoned. Due to lack of time, this was not tried

in the Harwood data set, but we expect that it would perform better here for the same

reason that the nearest neighbor approach would perform better with it.

4.2.5 Other methods

Some other methods that have been used for dealing with the missing value problem

are: autoassociative neural networks, decision tree estimation and multiple imputation.


For more details see (Pyle, 1999), (Fujikawa, 2001).

4.3 The EM algorithm for gap filling

The EM algorithm (Little and Rubin, 1987) is a generic iterative maximum likelihood

parameter estimation algorithm. It has been used for the training of mixture mod-

els and bayesian networks, estimation of probability density function parameters with

missing values and of course for filling in missing data in incomplete data sets. The

algorithm is slightly different in these applications but in all cases it has the same ba-

sic two step iterative structure. There is always the Expectation Step (E-Step) and the

Maximization Step (M-Step). We will describe these steps for the gap filling algorithm

specifically.

Finding the maximum likelihood parameters of the distribution of an incomplete data

set is closely related to filling in the missing values. If we assume a probability density

function for the data and we know the maximum likelihood estimates of the parameters

of the function we can fill in the missing values based only on the observed ones. The

filled in value is the conditional expectation value. This means that we fill the missing

values with the expected values according to the estimated distribution and the values

that are observed for each datapoint. Filling the missing values using the estimated

parameters of the probability density function with the conditional expectation values

is the E-Step of the EM algorithm. On the other hand, when we have missing values,

computing the estimates of the parameters just by ignoring the missing values will re-

sult in very poor and biased results. This is because a large part of the information that

is available in the data is ignored. If we have a completed data set though, obtaining

estimates of the parameters for the completed data set is straightforward and can be

done easily. This is done in the M-Step of the algorithm: we get new estimates for

the parameters using the filled data set. We then use those new estimates again to fill

the missing values (we repeat the E-Step) and then again we reestimate the parame-

ters and so on. We stop when either the filled in values or the estimated parameters

don’t change much from one iteration to the next. This iterative procedure will give

good maximum likelihood estimates of the parameters and missing values. One iter-


ation would not be enough, we have to repeat the steps until the algorithm converges.

Summarising, the EM algorithm for gap filling consists of the following two steps:

1. Fill the missing values with their conditional expectation values based on the

observed values and using estimated distribution parameters (E-Step)

2. Reestimate the distribution parameters based on the filled in values (M-Step)

It is important to notice that if we had the maximum likelihood estimates of the pa-

rameters of the distribution, it would be sufficient just to compute the conditional ex-

pectation values for the missing values of each datapoint. If we just ignore the missing

values when we compute estimates of the distribution parameters the result will be bi-

ased. Since we don’t know these, we have to go through the iterative procedure of EM

so that we get their maximum likelihood estimates.

The choice of the probability density function is probably quite arbitrary but usually

the Gaussian is used. This is reasonable for many cases. Of course, other probability

density functions can be used. In this study, the multivariate Gaussian was used. The

multivariate Gaussian is:

p�x � µ � Σ �� 1

det�

2πΣ � exp � 12

�x � µ � T Σ � 1 � x � µ �

The multivariate Gaussian has two parameters, the covariance matrix Σ and the mean

vector µ. If we have a complete data set the maximum likelihood parameters of the

distribution are:

µ � 1N

N

∑i � 1

xi

Σ � 1N

N

∑i � 1

�xi � µ � T � xi � µ �

where N is the number of datapoints.

Having examined roughly the basic idea of the EM algorithm for gap filling and esti-

mation of distribution parameter estimation with missing data, it is time to see in detail

how the steps of the EM algorithm work. We will examine the case of the Gaussian

distribution.

We start with an initial guess for the covariance matrix and the mean vector. Us-

ing these we go through every datapoint that has missing values and we compute the


conditional expectation values for the missing values. If for a specific datapoint the

available values are xp and the missing values are xm then we look for the values that

maximize p�xm � xp � Σ � µ � . The estimates of the missing values are computed from:

xm � µm � �xp� µa � B

where µm is the part of the mean vector corresponding to the variables that are missing,

and µa is the part of the mean vector corresponding to the variables that are available.

B is the estimated regression coefficients matrix and is:

B � Σ � 1pp Σmm

Using the formulas above we can fill in the missing values. The next step is to rees-

timate the covariance matrix and the mean vector. Reestimating the mean vector is

straightforward, we just use the maximum likelihood estimator shown previously. For

the covariance matrix though, we have to be more careful. If we just compute it again

using the maximum likelihood estimator we underestimate it. This is similar like the

situation we encountered previously when we talked about the regression based gap

filling methods. The regression function is virtually noise free and can lead to bad es-

timation of the real values leading further to underestimation of the covariance matrix.

For this reason, we need an estimate of the resisual covariance matrix C:

C � Σmm� ΣmpΣ � 1

pp Σpm

where Σmm � Σmp � Σpm � Σpp are partitions of the estimate of the covariance matrix con-

sisting of the set of rows and columns corresponding to the present and missing values

of the datapoint. This will have to be included in the new estimate of the covariance

matrix.

When we get the new estimates of the covariance matrix and the mean vector we repeat

the E-Step and then again the M-Step until in two subsequent iterations the parameters

or the filled in values don’t change much.

In this project, a variation of the simple EM algorithm for missing data was used. That

is the regularized EM algorithm. The only difference between this and the simple al-

gorithm is that here the regression coefficients are computed in a slightly different way.


More specifically, the regression coefficients are now:

B � �Σpp � h2Diag

�Σpp � � � 1Σmm

That is, instead of using Σ � 1aa we use a regularized version of the same matrix. This

method is known as ridge regression and can result in models with better generaliza-

tion results. The regularization parameter h2 controls how smooth the function will be

and given that an appropriate value is found we get better predictions and better gener-

alization performance. The regularization parameter is determined using generalized

cross-validation. More details about generalized cross validation and the regularized

EM algorithm can be found in (Schneider, 2001). Further information about the EM

algorithm and its use in gap filling problems can be found in (Little and Rubin, 1987).

For running the regularized EM algorithm, some Matlab functions written by the au-

thor of (Schneider, 2001) were used.

4.4 Suitability of the EM algorithm to our problem

As stated earlier, it is not easy to be sure about the results of a method for handling the

missing values. We have examined a few methods, we discussed some arguments for

and against each of them and we finally decided to use the regularized EM algorithm

that was described in the previous section. The EM algorithm as we have described it

works very well when there is a not too big part of the data missing, the data is MAR

and it is reasonable to say that the data follow a simple Gaussian distribution. In our

case the part of values that is missing is indeed not too big, the data is MAR but we are

not really sure that the data can be adequately described with a Gaussian distribution.

Anyway, even if the data can not be very well described with a Gaussian distribution

we can expect that the estimated values will not be totally irrelevant and that they will at

least maintain some part of the information of the relationships between the variables.

We can gain some confidence that EM is not bad for our problem if we see that basic

statistics of the filled in variables are not drastically altered compared to the ones prior

to gap filling, or that totally out of range values are not introduced or that the behavior

of a variable through time is not largely affected. For both of the data sets used in


0 50 100 150 200 250 300−20

−10

0

10

20

30

0 50 100 150 200 250 300 350 400 450−0.5

0

0.5

1

1.5

Figure 4.1: Filled parts for the Fc and the m Ustr L0 L0 variables of the Griffin data set.

The variables maintain a behaviour in the filled part similar to the one that is observed

in the present part

this study, the filled data sets seemed to be quite consistent with the ones prior to gap

filling. The basic statistics were not largely affected and we don’t have out of range

values. Also, the variables seem to maintain a regular behaviour through time. This

can be seen in figure 4.1. The continuous line is the observed values of the variable

and the dotted line is the filled values.

Similar things can be seen in Figure 4.2 for two variables of the Harwood data set. It

is worth noting though that in a few cases, similar plots showed too smooth fillings or

fillings where the variable had somewhat smaller variance than indicated from the rest

of the plot. In general though, the variables have maintained some regularity and no

out or range values seemed to appear, therefore we can think that both data sets have

been filled in a quite consistent way.


0 50 100 150 200 250 300 350 400 450 500 550−30

−20

−10

0

10Fc

0 50 100 150 200 250 300 350 400 4500

1

2

3

4

5

6

7tMeanU

Figure 4.2: Filled parts for the tFc and the tMeanU variables of the Harwood data set.

The variables maintain a behaviour in the filled part similar to the one that is observed

in the present part


4.5 Noise / smoothing

Having finished with the missing value problem we will now discuss about the prob-

lem of noise. We have argued earlier that the data contained noise. The reason is that

the data come from physical measurements that are prone to noise. Since we want to

model the clean data generating mechanism and not the noise, it would be desirable to

reduce the presence of noise.

A quick and easy method that seems to work well in most cases is moving averages.

Given a sequence of N points, the n-moving average of this sequence is another se-

quence of N � n � 1 points. Each point is computed as the mean of the surrounding n

points in the original sequence. If a j denotes the jth point of the original series and si

denotes the ith point of the new series then:

si � 1n

i � n � 1

∑j � i

a j

It is obvious that this is a very simple method. On the other hand it can be very effi-

cient in filtering out noise. The choice of n is a little arbitrary though. We want to have

smoothed variables but we also want to maintain the general trend of the variables. If

n is too small, then probably noise is probably not very well filtered out. If n is too big,

then the general trend of the variables is lost. The only way to determine n is to try.

Let’s check this in practice for the case of the Harwood data set. In Figure 4.3 we can

see the variable tmvpd with no smoothing, 3 points, 5 points and 10 points averaging.

We could certainly say that the unsmoothed variable looks a little rough. This rough-

ness could easily be interpreted as noise to the general shape of the variable. Ideally,

we would like to maintain only the general shape. The smoothed versions look much

better. Three points averaging looks still a bit rough though. Five points moving av-

erages looks quite good but ten points moving averages looks a little bit too smooth,

some details from the general shape of the variable seem to be lost. Similar results

were obtained for more examined variables for both data sets.

Unfortunately, quantifying the level of noise is very difficult. Therefore, determining n

or choosing a specific method for filtering noise is very hard. We can not be sure about

possible loss of information. This would be possible if we had a sample of the clean

data so that we could separate the real trend from noise.


0 100 200 300

1

1.5

tmvpd: No smoothing

0 100 200 300

1

1.5

tmvpd: 3 Step moving averages

0 100 200 300

1

1.5


0 100 200 300

1

1.5


Figure 4.3: Not smoothed, 3 points, 5 points and 10 points moving average for the first

300 points of the variable tmvpd of the Harwood data set


Anyway, considering the fact that we ultimately want to unveil the underlying process

and not to model noise, we will use smoothed data. We will use mainly the five points

averaged data since this seems to preserve most of the essential information. On the

other hand, it is possible that the smoothing process resulted not only in filtering out

noise but in real loss of information. Thus, we will also keep on using unsmoothed

data, for comparison.

Chapter 5

Dimensionality reduction

The main issue with the Griffin data set was its size. This made it very expensive to

process. It was not only the about 60000 data points (initially 100000) that it contained

but also its high dimensionality, the large amount of variables it had. As described

earlier, there were originally 310 variables but finally 281 were decided to be used.

That is, 280 input variables and 1 target variable. This made difficult not only the

missing values problem, as described in the previous chapter, but also further use of

the data set for modelling. This data set is very big for a sophisticated machine learning

algorithm. Trying, for instance, to train an MLP with 60000 examples and 280 input

variables would be very expensive. In such a case, running the model many times to

optimize its parameters, for example the number of hidden units, will not be feasible.

In addition, the complete set of 280 input variables will probably be a poor set of

predictors. This is because if there are irrelevant features in it, it is quite possible that

they will add noise to the final prediction. The learning algorithm might find false

regularities and this will lead to poor models. This is an overfitting effect and it is

not desirable. Also we want to have models as simple as possible so that they are

easily interpretable. For these reasons, some dimensionality reduction techniques had

to be used. A few methods were tried under a few assumptions that will be described

later and different subsets of features were obtained. For the Harwood data set the

motivation for dimensionality reduction was different. It was not mainly the cost of

the machine learning algorithm, but the need to find the best set of predictors.

In this chapter we will first briefly discuss about the dimensionality reduction and

29

Chapter 5. Dimensionality reduction 30

feature selection problem. Next we will examine the methods that were used in this

study and finally, we will talk about the results obtained from our methods.

5.1 The dimensionality reduction problem

When having to deal with high-dimensional data sets, usually some dimensionality

reduction technique has to be used. The reason for this is that a big data set can be

very expensive to use with a sophisticated modelling technique. In addition, it is quite

probable that at least some variables from the initial set are not relevant to the problem

and should be ignored.

Many methods have been used for reducing the number of features of high dimensional

data sets. These can be divided in two main categories: feature extraction and feature

selection methods. Feature extraction methods create new features from the existing

ones. Principal Components Analysis (PCA) and Locally Linear Embedding (LLE)

are examples of feature extraction methods. These methods make transformations and

recombinations of the original features to produce new ones. For example, PCA takes a

linear combination of the original variables. The number of the new features is smaller

than the original but hopefully they preserve the necessary information for subsequent

processing. The disadvantage of those methods is that usually the meaning of the orig-

inal features is lost and the models built from automatically extracted features are hard

to interpret. It is also worth noting that feature extraction methods are not always able

to cope with the problem of relevance of features. In this study, this is not desired be-

cause our original aim is to gain insight to the carbon uptake process and how various

factors affect it. It would be difficult to say for example that Vapour Pressure Deficit

has an important effect on the process.

The other main category of dimensionality reduction methods, feature selection meth-

ods, choose a subset of the available variables. The meaning of the variables is not

lost and therefore models can be more easily interpreted. We want to find the subset of

features with the best predictive accuracy for the model that will be used. With predic-

tive accuracy we mean the generalization accuracy, the performance of the model on

unseen data. As discussed earlier, the best set of features for predicting the target vari-


able is probably not the full set of features. Probably the full set of features contains

irrelevant with the task variables that add noise to the final prediction. We need only

the variables that really affect the target variable. Choosing a subset of the variables

is not an easy task though. In a data set with n dimensions there are 2n � 1 possible

subsets of features. Supposing that we have defined a measure of the predictive accu-

racy of a subset of features, it is obvious that with a big number of features it will be

very difficult to give a score to every possible subset and then pick the best. Therefore,

feature selection can be seen as a big search problem. We want to find the subset that

has the best generalization score but we can’t consider all the combinations. Thus, for

high dimensional data sets, where exhaustive search is not possible, various heuristic

search techniques are used.

Independently of the search algorithm used, feature selection methods can be divided

in two groups. One is the wrapper approach and the other is the filter approach. In the

wrapper approach we evaluate each subset based on a direct estimate of its predictive

accuracy using the model that we ultimately intend to use. The estimate is usually

taken using k-fold cross validation. That is we split the data set in k parts, and then for

k times we train the model with k � 1 parts and test the model with the remaining part.

We do this k times using each time a different part for testing and the rest for training.

This can give a good estimate of the real error of the tried model and it can help in

avoiding overfitting and ultimately detecting irrelevant features. Therefore, a wrapper

algorithm takes into account this information and guided from a search procedure it

looks for the subset that has the best generalization error estimate. Apparently, this

can be very expensive, depending on the size of the data set and the complexity of

the model that is to be optimized. Wrapper methods are usually suitable for not too

big problems or when a relatively large computational cost is not very important. For

larger data sets, usually a filter algorithm is prefered. Filter algorithms for feature se-

lection choose subsets of features based on general statistics of the data and therefore

don’t take into account the model that will be actually used. Apparently, we can expect

better results from a wrapper algorithm, whereas a filter algorithm is cheaper but will

probably not give an optimal solution.

More details about the subject of irrelevant features and the feature selection problem


can be found in (John et al., 1994) and (Langley, 1994). In this study, some variances

of a semi-wrapper approach and a filter algorithm were tried.

5.2 The semi-wrapper methods

As we said, the wrapper approach can be quite expensive, depending on the size of the

data set and the complexity of the learning algorithm that will be used. Considering

the fact that initially we had to do with the Griffin dataset and that we wanted to use

ANN and SVR for modelling, the direct wrapper approach looked rather infeasible.

Therefore we decided to use a slightly different method. Our approach was to use a

small and cheap model to estimate the predictive capacity of a subset of features. We

hope that the cross validation error using a simple model will give an indication about

the suitability of a set of features for modelling with a more sophisticated and expen-

sive model. The cheaper model was Multiple Linear Regression (MLR), multivariate

least squares fitting which can be performed very fast (section 6.1.1. This could be

considered to be a semi-wrapper approach since it uses actual model fitting but it does

not use the model that will be ultimately applied to the data. One could argue that this

is not a suitable solution to the problem. In some sense this is right. It is very likely

that the best model is non-linear and we cannot know the nature of this non-linearity

without exploring in a great extent the model space. The semi-wrapper approach is not

the optimal solution. The optimal solution is clearly the pure wrapper approach but

when we face big computational challenges like the one we had with the Griffin data

set, something feasible and reasonable for the problem has to be found. Nevertheless,

this solution is suggested in (Bishop, 1995) and it finally looks like a good idea. Ac-

tually we could hope that the target variable depends from the best set of predictors

in an approximately linear way and therefore a linear model could give a sufficiently

good approximation. Furthermore, applying later the non-linear and complex model

with the selected subset of features we would expect to get an even better fit.

The difference between this approach and the pure wrapper approach is only the way

that we evaluate the subsets. The rest is the same. We still need a search procedure.

Next we will describe the search procedures that we used. These are just heuristic ways


to avoid exhaustive search. The first three of the next methods are greedy search algo-

rithms whereas the last two are stochastic. The greedy methods may find a sub-optimal

solution, they can stop to a local maximum of the search space whereas the stochastic

methods can avoid local maxima but they also cannot guarantee a sufficiently good

solution. All of them were implemented in Matlab and in all of them 10 - fold cross

validation MLR was used to assess the generalization error. In what will be described

next, it will be handy to use binary strings to represent possible subsets, where a bit in

the kth position determines the presence or absence of the kth feature to the subset that

the string represents.

5.2.1 Forward selection

Forward selection is one of the simplest and most commonly used search methods for

feature selection using the wrapper approach (suitable for our semi-wrapper approach

also). We start with an empty set of features and we iteratively add the one that de-

creases most the estimated generalization error. We can stop either when a maximum

predetermined number of features has been selected or when there is no variable that

can be added to the subset and improve it. Alternatively, we could stop when the error

can’t decrease enough by adding any variable. In our experiments, having no prior

knowledge about the number of features that are really the important predictors of the

target variable, we just stopped when there was no feature that could be added to the

set of selected features that decreased the estimate of the generalization error. Using

the binary string representation, the forward selection algorithm is described in Table

5.1.

The main disadvantage of forward selection is that it is unable to predict strong de-

pendencies between sets of variables. For example, it is possible that two variables are

highly predictive together but each one alone is not a good predictor. Forward selection

will have difficulty detecting a situation like that.


1. Initialize the current solution string to contain only zeros and set the current

solution error to a large number.

2. For every zero bit in the current solution string, generate a new string, turn-

ing that bit to one.

3. Calculate the cross - validation error for the subsets that the generated strings

represent.

4. If the termination criterion is satisfied then stop else update the current so-

lution string with the best of the generated strings, set the current solution

error to the estimate of the error of this string and go back to 2.

Table 5.1: The forward selection algorithm

5.2.2 Backward elimination

This is very similar to forward selection but it works the opposite way. It starts with

the full set of features and successively removes the feature whose absence results in

the best performance. Again we stop when no improvement is possible or when we

reach a specific number of variables. The algorithm is described in Table 5.2.

1. Initialize the current solution string to contain only ones and set the current

solution error to a large number.

2. For every one bit in the current solution string, generate a new string from

the previous one, turning that bit to zero


represent




Table 5.2: The backward elimination algorithm


1. Initialize the current solution string (with zeros, ones or randomly), evaluate

it and set the current solution error to this.

2. For every bit in the current solution string, generate a new string, flipping

that bit


represent




Table 5.3: The best ascent hill climbing algorithm

Backward elimination does not have the problem that forward selection has at the same

extent. It is though more expensive because it starts with the full set of features and

therefore starts examining a set of models that have a big number of input variables

that are harder to train. Although, this is very serious in the wrapper approach, in the

semi-wrapper approach, if the data set is not that big, it is not a very big problem.

5.2.3 Best ascent hill climbing

This is a combination of the previous two. At every step is adds or removes the attribute

whose addition or removal results in the smallest error. It is more flexible and searches

the space of solutions a little better than the previous two that are very strict. The initial

point of search could be either a random subset of the variables or the empty set (so

that we can compare it with forward selection) or the full set (so that we can compare

it with backward elimination). The termination criteria can be the same as for forward

selection or backward elimination. The algorithm is presented in Table 5.3.

This is a little more expensive than forward selection but not necessarily as expensive

as backward elimination. It is also worth noting that in highly complex error spaces,

when we start from different random initial strings, we may get different solutions.

This doesn’t happen with forward selection or backward elimination because the initial


point is always the same. This shows how difficult it really is to do optimal feature

selection in some cases.

5.2.4 Genetic search

Genetic algorithms are a population based technique that have the ability to search ef-

ficiently in very large spaces. Considering the fact that the feature selection problem is

basically a big search problem, genetic algorithms looked like an interesting approach.

They maintain a population of candidate solutions that evolve through crossover and

mutation, imitating natural evolution. We will not go in explicit details on how genetic

algorithms work here. There are many different varieties of genetic algorithms and we

cannot examine many details here. We will just discuss about their application to our

specific problem and from this the basic idea will be made clear. The outline of the

genetic algorithm used in this study can be seen in Table 5.4.

The basic structure should be quite clear. The population evolves using mutation and

crossover, while the selection process reinforces good chromosomes. Hopefully in the

end the population contains a quite good solution.

We should explain some of our choices for the genetic algorithm. First of all, we have

limited the length of the part that is exchanged in the binary chromosomes. This was

decided to avoid the possibly big disruptive effects of crossover. We know that there

are good combinations of features that can be recombined to give even better sets of

features. If we had not set a limit to the length of the exchanged part, then crossover

would probably destroy good combinations of features. On the other hand, now we

can hope that this will not be that frequent but at the same time, the exchange of useful

chromosome bits will be possible.

Another important issue is the choice of fitness function and the selection mechanism.

Here as the fitness function we used just the Mean Squared Error (MSE) obtained

from k-fold cross validation. The selection mechanism was tournament selection. This

results in maintaining the variety in the population. If some other selection mechanism

was used, it would be possible to have much less variety in the chromosomes and this

would reduce the explorative effect of crossover. On the other hand, we believe that


1. Generate a random population of binary strings of size n.

2. Evaluate the population, calculate the generalization error of the chromo-

somes (fitness function).

3. Pass the best k of them in the next population immediately.

4. Generate an intermediate population of size n � k by selecting strings based

on their generalization error and using tournament selection.

5. Apply two point crossover to the strings of the intermediate population: ex-

change parts from pairs of chromosomes. The length of the exchanged part

is limited.

6. Apply mutation to the strings of the intermediate population: randomly flip

some of the bits of the intermediate population.

7. Pass the strings of the intermediate population to the next population.

8. If the maximum number of iterations has been reached then terminate, else

go to step 2

Table 5.4: A genetic algorithm for feature selection


there are a few very good combinations of features and therefore we probably should

actively reinforce solutions close to those. For this reason, we use elitism, i.e. we keep

always the best solutions from generation to generation. This can help towards that

direction. Given that we want to maintain a reasonable balance between exploitation of

already found good solutions and exploration of new solutions, this looks like a quite

reasonable decision. Of course, we could decide to use more exploitation (possibly

using a more aggressive selection mechanism) considering that there are not many

good combinations of features and therefore, we should explore solutions close to

these mainly. Of course the ideal solution looks to be balancing search close to already

found good solutions and new solutions. This is probably worth experimenting with.

For a better description of basic concepts mentioned here see (Mitchell, 1996).

The genetic algorithm approach is clearly more expensive than the previous methods

but it can avoid local minima and therefore can give solutions that are very hard to find

with a greedy method.

There are also a couple of different approaches for using genetic algorithms for feature

selection but were not tried. In one of them, the problem is considered to be a dual

criteria optimization problem. One criterion for optimizing is the fit of the model to

the data measured with the explained variance and the other is the size of the model

(number of variables). It tries to do that using a�µ � λ � population management method

that will be also described in the next section. When choosing the µ - sized population

it uses one criterion and when it chooses the λ - sized population it uses the other.

More details can be found in (Wallet and et al., 1996).

Another approach can be found in (Yang and Honavar, 1999). This is quite similar to

our method but it has a different fitness function, it uses a different selection method

and it doesn’t use limited length two point crossover. The fitness function involves an

estimate of the error but this obtained using a fast neural network.

5.2.5 Evolution strategies

Evolution strategies are closely related to genetic algorithms. They are both evolu-

tionary population based techniques. The difference between evolution strategies and

genetic algorithms is that evolution strategies do not use crossover. The basic algorithm


1. Generate a random population of binary strings of size λ.

2. Evaluate population chromosomes, calculate their generalization error.

3. Shrink the population, keeping only the µ fittest chromosomes.

4. Expand the population, creating random mutations of chromosomes from the

µ - sized population chosen based on their fitness and getting a new λ - sized

population.

5. If the maximum number of iterations has been reached then terminate else go to

step 2.

Table 5.5: Evolutionary strategy for feature selection

is presented in Table 5.5.

Something important to note here is that we select the chromosomes of the µ popu-

lation from the whole λ population and not only from the children of the previous µ

population. This is called (µ � λ) selection and has an effect similar to the effect of

elitism in genetic algorithms. Considering what we said before, when we discussed

about genetic algorithms, about experimenting more with exploiting good solutions,

this looks like a reasonable choice. The other option would be (µ � λ) selection where

we would select the members of the µ population only from the children of the λ pop-

ulation.

Evolution strategies could be a suitable method for our problem. Given the fact that we

expect some good combinations of features, it would be reasonable to try random ex-

ploration around such combinations to find the real optimum. The cost of this method

is more or less the same as the cost of the genetic algorithm.

5.3 The filter method

The difference between wrapper (or semi-wrapper methods that were described in the

previous section) and filter methods is that wrapper methods evaluate a subset of fea-

tures using actual model fitting whereas filters use just statistical information of the


features. Many different filter methods exist based on information theory criteria, vari-

ance criteria, correlation criteria and much more. We tried only one filter method,

Correlation - based feature selection which is described next.

5.3.1 Correlation-based feature selection

Correlation - based feature selection (CFS) is described in (Hall, 1999). The CFS

algorithm looks for a subset of features that are highly correlated with the target but

not highly correlated with each other. This means that it looks for a set of features

that carry a lot of information for the value of the target but does not carry redundant

information. For this, a basic heuristic measure that has to be maximized has been

defined:

Merits � krt f�k � k

�k � 1 � r f f

where Merits is the desirability of the subset of features s, k is the number of features

that s has, rt f is the average correlation between the subset’s features and the target

variables and r f f is the average correlation between the subset’s features. This heuristic

measure will increase with a subset of features that have high average correlation with

the target and with a subset of features that have low correlation with each other. In

the case of continuous attributes, like the ones that we have here, correlation between

features is computed using Pearson’s correlation:

ri j � C�i � j ��

C�i � i � C � j � j �

Where ri j is the correlation between features i and j and C�i � j � , C

�i � i � and C

�j � j � are

elements of the covariance matrix of the data, that is C�i � j � is the covariance of fea-

tures i and j, C�i � i � is the variance of feature i and C

�j � j � is the variance of feature j.

In a highly dimensional data set, an evaluation of all possible subsets is still not possi-

ble although this now involves much simpler and cheaper calculations than the wrapper

or semi-wrapper approaches. Therefore, here we can also use any search mechanism

we think is suitable. In (Hall, 1999), best first search is proposed but other search

strategies would be possible.

It is also worth noting that the Merits heuristic is very aggressive: it tends to select only


a few features. For this reason, r f f is usually multiplied by a number between 0 and 1.

This discount factor can make the algorithm less aggressive and force it to select more

variables.

It is clear that this is not the best possible solution. We have argued before that filter

methods are not as good as wrapper methods (semi-wrappers also) but they are suit-

able for high-dimensional datasets. Consequently, this method was quite useful for the

Griffin data set.

5.4 Results of the feature selection methods

Here we will talk about the results and behavior of the previously described feature

selection methods. We will talk about both the Griffin data set and the Harwood data

set.

5.4.1 Feature selection results for the Griffin data set

Selecting the best subset of features in the Griffin data set was initially considered to

be a very big challenge. Most of the previously described algorithms were motivated

by the dimensionality of the Griffin data set and it’s huge number of possible subsets

of features. Unfortunately, it was at that point that it was discovered that there were

spurious variables in the data set. Therefore, we will just briefly mention what was the

general behavior of the feature selection algorithms with the Griffin data set. We will

just remind here that the initial number of input variables was 280 and that the error is

measured using 10-fold cross validation and a linear model.

Forward selection chose a subset of 95 variables with the final MSE being only 0.053.

The error dropped fast with the addition of the first few variables and then decreased

very little (Figure 5.1). Also, with the addition only of the first variable the error was

only 0.1590. The variable was m PPFDa P13 L1 and was one of the spurious vari-

ables. When plotted together with the target variable, it was found that they are almost

the same. Forward selection, even though we were using the semi-wrapper approach,

was quite expensive. We got the first few features quite fast but the algorithm termi-

nated after approximately one day. The first few variables were added after trying very


10 20 30 40 50 60 70 80 900.04

0.06

0.08

0.1

0.12

0.14

0.16

Variables

MS

E

Figure 5.1: MSE as forward selection adds variables for the Griffin data

simple and not costly models, but as more variables were added, the complexity in-

creased. If we had predetermined the number of features to select to a small number

the cost would not be that big.

Backward elimination was not possible for the Griffin data set. It was left running

for about twenty days and still did not finish. This is because it starts searching from

very big models (279 variables) and 10-fold cross validation is very expensive for such

models.

Best ascent hill climbing gave results quite similar with forward selection, just a little

better and it also chose a little smaller subsets. It has been tried starting either from a

random subset of features or from the empty subset. In both cases, the same spurious

variable was chosen first (in the case of random initial subset, when it was not already

included in it). Hill climbing was also quite expensive running for a couple of days.

With the genetic algorithm it was a little different. Usually it also found one of the

spurious variables quite fast showing that they could get a sufficiently good solution

fast. From here it came apparent that it was not only one variable that was an almost

duplicate of the target variable but at least three. It was observed, that when either of

those three variables was randomly present in any of the chromosomes, further signif-

icant improvement was very difficult and the error curve was decreasing very slowly

after that. Depending on the parameters for the number of generations (in most of the

experiments it was 50 generations), the number of chromosomes (40 usually) and ini-


tial population chromosomes, it could take from about 6 hours till a couple of days.

Evolution strategies behaved quite similarly to the genetic algorithm. Again, depend-

ing on the number of generations and size of populations, it could take quite a while.

Finally, CFS selected from 1 to 7 variables depending on the discount factor. Contrary

to the previous, it was not very time consuming, taking about 5 minutes to run. Without

using the discount factor it selected only the same m PPFDa P13 L1 variable. When

we decreased the discount factor, that variable was always in the set of selected.

5.4.2 Feature selection results for the Harwood data set

The feature selection problem in the Harwood data set was quite different. The search

space was much smaller. Therefore, the algorithms behaved in a different way. Here

we had only 17 candidate input variables. We also have to note that we tried the feature

selection algorithms for data with different smoothings. Thus, we will talk about both

of those cases.

Something very remarkable happened for the Harwood data set: all the algorithms se-

lected the same subsets of features for both smoothed and not smoothed data (different

for smoothed and not smoothed though). This makes clear the fact that the search space

is not that big and not too complex, resulting in many different algorithms finding the

same solution. But let’s check some details of the different algorithms.

First of all, forward selection stopped when it selected 14 variables for the not smoothed

data and 15 variables for the 5 points averaged data. The error curve for these cases

can be seen in Figure 5.2.

It is clear that the smoothed data can be modelled much better than the not smoothed

data. The not smoothed data reaches an MSE of 10.7976 and the smoothed data an

MSE of 6.7951. This will be valid in all of our next experiments and this is quite rea-

sonable as well since the smoothed variable is much easier to model. Forward selection

was much faster here than in the Griffin data set. It only took a few minutes. It could

be even faster if we stopped after the first few variables, deciding that we want only a

specific number of variables.

Backward elimination was possible here. The cost was not that big and it took less than

10 minutes. As we said the results for both the smoothed data and the not smoothed


2 4 6 8 10 12 14 166

7

8

9

10

11

12

13Not smoothed data5 points moving averages

Figure 5.2: MSE as forward selection adds variables for the not smoothed and 5 point

smoothed Harwood data

data were the same like forward selection.

Hill climbing was tried a few times from random initial subsets of features and in all

cases it found again the same solution. This actually showed that the error space prob-

ably has only one local maximum and this was the one that all algorithms found.

The genetic algorithm was also much faster of course than in Griffin. It was left run-

ning for about 50 generations but after about 10-15 it had found the same solution as

the other algorithms.

Evolution strategies worked like the genetic algorithm as well. It was left running for

50 generations but it had found the same solution after about 20 generations.

Finally, CFS was much different from the previous methods. On both smoothed and

not smoothed data (it doesn’t make a big difference in the correlation matrix anyway)

and with a range for the discount factor from 0.01 to 1 it always selected only one

variable. This variable was trh (relarive humidity).

Apparently in this data set, there was only one local minimum for the cross valida-

tion error of diffferent subsets of features and that was the global minimum. Since the

search space behaves so good, the use of so many different algorithms does not make

much sense. Of course, we couldn’t know that from the beginning. In a more complex

space though, that has many stationary points, as we expected for the Griffin data set,

the use of many different algorithms makes sense. It is possible that there is some


clearly optimum point that cannot be easily reached from some of the algorithms. In

our problem, we could expect that there is a subset of features that has a significantly

better performance than the others. This subset will not necessarily be reachable for

some of the algorithms but is worth finding if it is sufficiently better than the ones that

the other algorithms find.

5.5 Discussion about the results of feature selection

There is no point in discussing the results for the Griffin data set. On the other hand,

since we will use the Harwood data set for further modelling, we have to decide which

variables we will finally use. Clearly, we would expect to get a smaller number of

features. The selection procedure discarded only 2 and 3 features for the cases of

smoothed and not smoothed data. Looking at Figure 5.2 we see that the improvement

from the first features to the last is not that big. Therefore, we could discard many

of them. Actually, it was decided to use the first 4 features. Considering that a lot of

time will be spent adjusting the parameters of the models, this looked like a reasonable

decision. The four variables for the not smoothed data were:

1. tPARin (incoming radiation)

2. trh (relative humidity)

3. tLEc (latent heat flux)

4. tTs5 (soil temperature at 5 cm depth)

For the smoothed data the variables were:

1. tPARin (incoming radiation)

2. tLEc (latent heat flux)

3. tMeanQ (absolute humidity)

4. twdir (wind direction)

Chapter 6

Modelling

Having finished preprocessing, we will now continue with model fitting. Here we will

only use the Harwood data set, since modelling could not be carried out with spurious

variables. We would try to predict a value using almost the same value as input.

Two popular machine learning techniques have been used: Support Vector Regression

(SVR) and Multilayer Perceptrons (MLPs) that are a kind of Artificial Neural Net-

works (ANNs). Both are supervised learning techniques. That is, they are given some

examples of inputs and the corresponding outputs and are expected to learn the rela-

tionship between them, predicting the correct output for previously unseen input. Both

methods’ flexibility to model non-linear data has made them quite popular and they

have been applied to a huge variety of applications.

A library of machine learning routines written in C++, Torch 31 (Collobert et al., 2002),

was used for both MLP and SVR modelling. This was very handy since much of the

functionality needed was already implemented and we just had to put the necessary

routines together.

In this chapter we will first introduce MLPs and SVR.a long with MLR that was used

for feature selection and that appears also in the results for comparison. Next we will

talk about modelling and experimental design and finally we will present the results

obtained.1http://www.torch.ch/

46

Chapter 6. Modelling 47

x

y

Figure 6.1: The least squares solution given a set of observations�xi � yi � .

6.1 Modelling techniques

6.1.1 Multiple Linear Regression

Multiple Linear Regression (MLR) is the well known least squares fitting. The solution

of MLR is a simple line that best interpolates the data. This can be seen for one

dimensional input in Figure 6.1. If we have a d-dimensional input x the output of MLR

is:

y�x �� d

∑k

wkxk � b (6.1)

But how is best interpolation defined and how do we compute the coefficients wk and

b? The best line is the line that minimizes the Sun Squared Error (SSE):

SSE � n

∑i

�yi� yi � 2 (6.2)

The solution for wks and b can be found by solving an d � d sized linear system. It is

convenient for that to express everything in matrix notation. We define the vector w

that has�b � w1 �� wd � T and the augmented input vector x that is

�1 � x1 �� xd � T . Then (6.1)

becomes:

y�x �� wT x (6.3)

Now, defining X as the matrix that has all the x vectors as just defined as columns and

Y a vector of the corresponding target values,the SSE (6.2) is computed with:

SSE � �Y � wT X �� Y � wT X � (6.4)


Since we want to minimise this, we just have to differentiate it with respect to w and

equate to zero. Differentiation gives:

XTY � XT Xw (6.5)

The solution is:

w � �XT X � � 1XTY (6.6)

Instead of computing the inverse matrix of (6.6) we usually solve the linear system that

(6.5) represents. When we have w we have found the best interpolating line.

6.1.2 Multilayer Perceptrons

Multilayer Perceptrons (MLPs) belong to the family of Artificial Neural Networks

(ANNs). ANNs are inspired from natural neural systems where information processing

occurs in stages: the result of one stage is introduced to the next and so on.

The basic computational unit of an ANN is the neuron. The simplest model of a neuron

is the perceptron. A perceptron simply computes a weighted sum of its inputs and

computes a non-linear function of it:

y � g� n

∑j

w jx j � b �where w j are the weights of the inputs to the neuron, x j are the inputs to the neuron

and b is a bias parameter. The non-linear function g is called the transfer function and

some usual choices for it are the sigmoid, hyperbolic tangent and the gaussian.

A simple perceptron is not very powerful. Its limitations are well known in the lit-

erature (Minsky and Papert, 1969): for classification they can model only linearly

separable problems while similar inflexibilities appear for regression. The separability

problem appears in Figure 6.2. The decision boundary of a perceptron that is used

for classification is just a straight line for two dimensional input or a n-1 dimensional

hyperplane for n dimensional input. The problem on the left is linearly separable but

there is no line on the problem on the right that can separate the two classes. Similarly,

in Figure 6.3 we can see that the output of a neuron has variability only in one direc-

tion, showing that a simple perceptron can not do much for complicated regression


Figure 6.2: A linealy separable problem (left) and a non-linearly separable problem

(right)

−1

−0.5

0

0.5

1 −1

−0.5

0

0.5

10

0.2

0.4

0.6

0.8

1

Figure 6.3: The output of a neuron with a sigmoid transfer function.

problems also. This neuron uses a sigmoid transfer function but similar things happen

for the other transfer functions as well. The problem lies in the linear averaging of the

inputs.

The solution is to use the idea of sequential processing described earlier. If we add

hidden layers of neurons to the network we can represent any continuous function. A

MLP is exactly that, a sequence of layers of neurons. Is is a fully connected feedfor-

ward ANN, i.e. all neurons from one layer are fully connected with all neurons in the

next layer and connectivity flows only in one direction. Various other connectivities

and architectures of ANNs exist but MLPs are probably the most popular. A simple

MLP with one input layer, one hidden layer and one output layer is represented graph-


1

1

1

l

k

m

Figure 6.4: A MLP with one input layer, one hidden layer and one output layer

ically in Figure 6.4. The input layer has k, the hidden layer has l and the output layer

has m units. Each box is a neuron with the functionality described.

But how can we learn the weights and the biases of the network? Given that we want

to minimize the error, we can learn the parameters using a non-linear optimization al-

gorithm like gradient descent. This can be expressed also with a method called Back

Propagation (BP). We will not talk thoroughly about this here. We will just say that

BP propagates the training error from the output to the previous layers and accordingly

readjusts the weights. Details can be found in (Bishop, 1995).

An important issue with MLPs is that we have to decide about their architecture: the

number of hidden layers and the number of units. Any function that can be modelled

with a network with any number of hidden layers can be modelled from a network with

one hidden layer and enough units in it. Therefore, we decided to use only one hidden

layer. The important question though is the number of units in the hidden layer. If we

don’t have enough hidden units, the network will not have the capacity to model well

the data. On the other hand, if we have many hidden units, the network will overfit the

data and we will not have good generalization performance.

Another issue related with overfitting is the time that we stop training. If we leave BP

or gradient descent running even when the training error is very low, then the MLP will

overfit the data. If we stop training early, then it will not capture the real relationship

between the input and target variables. More details about the issue of overfitting in

neural networks and what has just been described can be found in (Lawrence et al.,


1996). Consequently, although MLPs are a very powerful class of models, their very

high capacity for non-linear modelling though is counterbalanced from the fact that

they are difficult to train.

6.1.3 Support Vector Regression

Support Vector Machines (SVMs) is a new technique. SVMs emerged from the work

of Vapnik on computational learning theory (Vapnik, 1995), (Vapnik, 1998). Below we

will try to briefly introduce the basic ideas of SVMs for classification and after that we

will talk about Support Vector Regression (SVR). A more detailed description of the

ideas presented here can be found in (Burges, 1998), (Christianini and Shawe-Taylor,

2000), (Cortes and Vapnik, 1995) and (Smola and Sch, 1998).

The first main concept of SVMs is maximum separability. When we talked earlier

about perceptrons, we introduced the idea of linear separability (Figure 6.2). The per-

ceptron chooses for the decision boundary any separating hyperplane. On the other

hand, the SVM solution is unique: it is the hyperplane that results in the maximum

separability between classes (Figure 6.5(a)). Intuitively this looks like the safest choice

for the decision boundary of classification.

Let ti denote the target value that will be +1 for the one class and -1 for the other.

The separating hyperplane is w � x � b � 0 so that for data points belonging to class +1

it is w � x � b � 0 and for data points belonging to class -1 it is w � x � b � 0. Let also d �be the distance between the closest to the separating hyperplane point that belongs to

class +1 and the separating hyperplane and d � be the distance between the closest to

the separating hyperplane point that belongs to class -1 and the separating hyperplane.

The margin of the solution is the minimum of d � and d � . It is clear that the margin is

minimized when d � � d � and margin � d � � d � .

Now, let’s define the hyperplanes H � and H � that are parallel to the separating hy-

perplane and have a distance equal to the margin from the separating hyperplane. At

least one data point from class +1 lies on H � and one data point from class -1 lies on

H � . The distance between the H � and H � is twice the margin. If x � is a data point

lying on H � and x � is a datapoint lying on H � then the distance between H � and the

origin of the axes is w� x ��w� and the distance between H � and the origin of the axes is


(a) (b)

margin

w

w.x+b<0

X+

X−

H−H+

w.x+b>0 H

Figure 6.5: (a) Some separating hyperplanes and the maximum margin one (b) The

maximum margin hyperplanes H, H � and H �w � x ��

w� . Therefore, the distance between H � and H � is: w � x ��

w� � w� x ��

w� � w� x � � w � x ��

w� and what

we get is:

2 � margin � w � x � � w � x ��w� (6.7)

Furthermore, we can see that we can scale w and w0 so that we get essentially the same

separating hyperplane cw � x � cw0 � 0. Then we can choose w and w0 so that for all

data points:

w � xi � b � � 1 f or ti �� 1 (6.8)

w � xi � b � � 1 f or ti � � 1 (6.9)

It is clear that for x � (6.2) will hold as an equation and for x � (6.3) will hold as an

equation also, that is:

w � x � � b �� 1 (6.10)

w � x � � b � � 1 (6.11)


If we subtract the second from the first, we get:

w � x � � w � x � � 2 (6.12)

If we substitute the left hand side of this to (6.7) we get

margin � 1�w� (6.13)

We now have an expression for the margin. Given that we want to maximize the

margin, the problem now is to minimize�w�

subject to constraints (6.8) and (6.9). The

solution to this problem is given by a quadratic programming problem using Lagrange

multipliers and has the form:

w � n

∑i � 1

aitixi (6.14)

In this formula, ais are coeeficients that determine the contribution of each vector to

the solution. For most of the data points, ai is zero. Only the data points that lie on H �and H � have not zero ai. Those data points are the support vectors. Having found the

hyperplane with the maximum margin we can classify a new example x with:

tx �� 1 i fn

∑i � 1

aiti�xi � x � � 0 (6.15)

tx � � 1 i fn

∑i � 1

aiti�xi � x � � 0 (6.16)

But what happens when the data is not linearly separable? Then a positive slack vari-

able ξi is introduced for every data point and instead of (6.8) and (6.9) we now have

the following constraints:

w � xi � b � 1 � ξi f or ti �� 1 (6.17)

w � xi � b � � 1 � ξi f or ti � � 1 (6.18)

Doing this, we allow violation of the hard margin that the previous constraints implied.

We now have a soft margin that is allowd to break but with a penalty equal to xii. We

want to minimize both�w� 2 and ξis. The new function to be optimized is:

J � �w� 2 � C

� n

∑i � 1

ξi � (6.19)


feature spaceinput space

Figure 6.6: Transformation from the input space to the feature space where the problem

is linearly separable

C is a parameter that determines how important is the minimization of�w� 2 versus the

minimization of the slack variables. If we have a large C we get a higher penalty for

errors and we enforce more accurate solutions. The form of the solution is again the

same (equation 6.14).

The second ingredient of SVMs is the kernel trick. It is well known that in a very

high-dimensional space, data points have more chances of being linearly separable.

It would be handy to be able to map the input data to a higher dimensional space

where it is linearly separable. This higher dimensional space is called the feature

space and the transformation from the input space to the feature space where the data

is linearly separable can be seen in Figure 6.6. Let’s suppose that we achieve this with

a function Φ�x � . Now, having mapped the input space to the feature space, we can find

the maximum separating hyperplane in the feature space. However, the feature space

is used only in terms of inner products between vectors like Φ�xi ��Φ �

x j � . Fortunately,

we can avoid the very expensive computation of Φ�xi �� Φ �

x j � andΦ�xi ��Φ �

x j � in the

higher (even infinite) dimensional space. This is achieved with a kernel function k

such that k�xi � x j � � Φ

�xi ��Φ �

x j � . We can use the kernel function instead of computing

the inner products to train the SVM and find the best separating hyperplane in the

higher dimensional space. Nothing else changes in the solution, we just substitute the


inner product xi � x j with Φ�xi ��Φ �

x j � . A widely used kernel is the gaussian:

k�xi � x j �!� exp � �

xi� x j

� 22σ2 (6.20)

This actually corresponds implicit computation in an infinite dimensional feature space.

This is the kernel used in our expirements also.

Having examined the basic consepts of SVMs, it is time to talk about SVR. First, let’s

remember that in classification we tried to put the data out of the hyperplanes H � and

H � . For the regression task something like that is not applicable, instead we want to

have the data as close as possible to the predicted value but not necessary exactly on

the predicted value, we allow some margin. For this reason we use the ε-intensive error

function:

Eε�z ��

" � z � � ε i f � z �#� ε0 otherwise

The ε-intensive error function is plotted in Figure 6.7, where we can see the ε-tube that

surrounds the prediction line. The ε-intensive error function tolerates errors smaller

than ε. In addition, there is again a soft margin that gives a penalty, i.e. the predicted

value is allowed to differ from the predicted value more than the quantity ε but this

way we get again a penalty. The penalty is again expressed using slack variables but

this time we have two, one for overpredicting and one for underpredicting, ξi and ξi.

The function that we want to minimize to get the solution is now:

J � �w� 2 � C

n

∑i � 1

�ξi � ξi � (6.21)

under the constraints:

y�xi � � ti � ε � ξi (6.22)

ti� y

�xi � � ε � ξi (6.23)

and the form of the solution is

w � n

∑i � 1

βixi (6.24)

Again, many of βis are zero. The data points that are in distance smaller than ε from

the tube have βi zero. The rest are the support vectors. Then we compute the prediction


e

x

x x

x

x

x

x

x

x

Figure 6.7: The epsilon function

for a new input x with:

y�x �!� n

∑i � 1

βixi � x � b � (6.25)

Again this can be kernelized so that the output is a nonlinear function of the inputs.

Like in the classification problem, nothing else changes, we just substitute the inner

products in the input space with the inner products in the feature space.

6.2 Modelling design

6.2.1 Parameter optimization

We have described two classes of machine learning models. Indeed, we did not de-

scribe some specific models but a huge set of models. To do modelling we have to

determine their parameters. That is we have to choose a specific model from the huge

space of models. There is clearly at least one setting of parameters for a specific class

of models, such that the performance is maximized. In practise though, finding the

specific values of parameters that maximize the performance is a very difficult, if not

impossible in many cases, task.

For the MLP we have to determine the number of hidden neurons and the training stop

criterion, i.e. the required accuracy of the model on the training set. For SVR we have

to determine the size of the ε � tube, the tradeoff parameter C and the width of the


gaussian kernel σ.

Unfortunately we cannot guarantee that if we first try to optimize one of the parameters

keeping the others constant and then doing the same with the others will give the best

results. Since optimizing the parameters is a huge search problem, it is obvious that we

need to compromise the cost of finding a good solution and the accuracy of the final

model.

Since we cannot consider all the possible settings for parameters, they are infinite, we

will just try to explore and find the best one that we can using values around the usual

setting for similar problems. What we have done here is that we setup a small pro-

gram that tries many different parameters around a range of reasonable values. We

used 3-fold cross validation to get an estimate about the suitability of an SVR model

and 10-fold cross-validation for MLPs. 3-fold cross validation was very expensive for

SVR. In order to have comparable results we wanted to do 10-fold cross validation

with SVR modelling but this would be very expensive.

6.2.2 Time series modelling

We must not forget that the Harwood data set consists of half-hourly measurements.

Therefore, it could be modelled using a time-series framework. Indeed we can expect

that this would give better results. The reason for that is that plants react almost im-

mediately to changes in their environment but not instantly. Usually, some time passes

until the results are apparent. Of course, in order to check that we have to optimize

also the number of time steps used for modelling, or include in the feature selection

process variables from previous time points, that we didn’t. We will try though time-

series modelling of 3-steps and 5-steps. Therefore, considering that we will try also

smoothed and unsmoothed data, we now have six combinations for each model:

1. Not smoothed, no time series modelling

2. 5 point smoothing, no time series modelling

3. Not smoothed, 3-step time series modelling

4. 5 point smoothing, 3-step time series modelling


5. Not smoothed, 5-step time series modelling

6. 5 point smoothing, 5-step time series modelling

6.2.3 Performance evaluation

The Mean Squared Error that was used for choosing between different models when

we did feature selection can also be used here for picking the best model. It does not

though give a good idea about how good the fit is. Therefore, we will use R2, the ex-

plained variance. A high value is desired. Also, tha bias tovariance ratio is presented.

This is equal to 1 � R2. A low value for this is desired.

It is worth noting that we compute this always on the test part of the cross validation

procedure so that we get a real estimate of the performance of the model. After com-

puting the measure of the performance for each fold we average them to get an overall

estimate of the performance. This way we can be quite sure that the performance of

the models on new data is reflected in that result.

6.3 Results

For comparison reasons, we first give the results of modelling using MLR. This will

help us see if the more complicated, non-linear machine learning model really im-

proves the prediction. This will give an indication if the idea of doing feature selection

on the linear model and then using the non-linear model for more precision was right.

Next we present the results for MLPs and SVR. For those we will only present the

performance measures for the best models found.

6.3.1 Multiple Linear Regression

MLR modelling achieves a rather good performance. The best model used the smoothed

variables and 3 steps for time series. It achieves 78% explained variance. This im-

proves a little on non-time series modelling with smoothed data, that had almost 76%

explained variance. In every case, modelling with smoothed data is more accurate than


with not smoothed data. Also, adding 2 more steps in time series modelling does not

increase the performance. The results can be seen in Table 6.1.

No smoothing Best R2 = 0.6614

MSE=11.5091

bvratio=0.3386

5-point smoothing Best R2 = 0.7595

MSE=7.1977

bvratio=0.2405

No smoothing, 3-step time series Best R2 = 0.7110

MSE=9.8158

bvratio=0.2890

5-point smoothing, 3-step time series Best R2 = 0.7818

MSE=9.8158

bvratio=0.2182

No smoothing, 5-step time series Best R2 = 0.7147

MSE= 9.6888

bvratio= 0.2853

5-point smoothing, 5-step time series Best R2 = 0.7817

MSE= 6.5287

bvratio= 0.2183

Table 6.1: The results for MLR

6.3.2 Multilayer Perceptron

MLPs clearly increased a little the performance compared to MLR but probably not as

much as we expected. Again the smoothed data give more accurate models but here

time series modelling has a small impact on performance. The results for MLPs are

presented in Table 6.2.


No smoothing Best R2 = 0.7456 NHU=12 Stop MSE=0.01

MSE=8.6599

bvratio=0.2544

5-point smoothing Best R2 = 0.8257 NHU=29 Stop MSE=0.05

MSE=5.2196

bvratio=0.1743

No smoothing, 3-step time series Best R2 = 0.7620 NHU=9 Stop MSE=0.001

MSE=8.0949

bvratio=0.2380

5-point smoothing, 3-step time series Best R2 = 0.8260 NHU=17 Stop MSE=0.05

MSE=5.2144

bvratio=0.1740

No smoothing, 5-step time series Best R2 = 0.7641 NHU= 12 Stop MSE= 0.001

MSE= 8.0168

bvratio= 0.2359

5-point smoothing, 5-step time series Best R2 = 0.8276 NHU= 16 Stop MSE= 0.001

MSE= 5.1575

bvratio= 0.1724

Table 6.2: The results for MLPs

6.3.3 Support Vector Regression

SVR is also distinctly better than MLR and in most cases it is at least almost as good

as MLPs. SVR also gave the most accurate model, 83% explained variance with

smoothed data and no time series modelling. Like in MLPs time series modelling

had no effect on performance and smoothed data gave more accurate models. It is

worth noticing that finding appropriate values for the parameters σ � ε and C was very

difficult: much more difficult than finding a good number of hidden neurons or the end

accuracy for the MLP. It was much slower than both previous methods. A 3-fold run

took a few hours and a 10-fold run took even more time. This could explain the fact

that five step time series modelling didn’t very well. It is quite probable also that for

the other experiments we didn’t find the optimal settings of parameters. However, we


expected better models from SVR. The results for SVR can be seen in Table 6.3.

No smoothing Best R2 = 0.7541 σ=110 ε=0.03 C=10

MSE = 8.3649

bvratio = 0.2459

5-point smoothing Best R2 = 0.8299 σ=140 ε=0.03 C=10

MSE = 5.0893

bvratio = 0.1701

No smoothing, 3-step time series Best R2=0.7465 σ=110 ε=0.03 C=10

MSE=8.6076

bvratio=0.2535

5-point smoothing, 3-step time series Best R2=0.8147 σ=100 ε=0.03 C=5

MSE=5.5435

bvratio=0.1853

No smoothing, 5-step time series Best R2 = 0.7067 σ=120 ε=0.03 C=10

MSE=9.9766

bvratio=0.2933

5-point smoothing, 5-step time series Best R2=0.8212 σ=120 ε=0.03 C=10

MSE=5.3523

bvratio=0.1788

Table 6.3: The results for SVR

6.4 Summary

Three modelling methods were tried, MLPs, SVR and MLR. MLR was used for com-

parison between the previous two and to see if the semi-wrapper approach was reason-

able. Indeed, SVR and MLPs improved the performance compared to MLR but not

as much as we expected. MLPs were slightly better than SVR but not in every case.

The best model achieved almost 83% explained variance. This was an SVR model

and used 5-point smoothed data and no time series modelling. It was expected that

SVR would clearly outperform MLPs but wasn’t. The reason for this was that SVR


was very slow and it was very difficult to configure: we had to choose three different

continuous-valued parameters.

Chapter 7

Conclusions and further work

Having presented the final modelling and the results it is necessary to discuss if the

goal of this project is fulilled, reexamine our decisions throughout this study, see how

we could possibly improve things and propose ideas for further work and research on

this problem.

In this chapter we will first discuss about our final conclusions for what has been done

and what has been achieved, then we will talk about future work and finally we will

discuss about the automated model extraction approach that we have used.

7.1 Conclusions

First we have to ask if the goal of the project was fulfilled. Considering that our ini-

tial goal was to investigate the use of automated machine learning techniques to the

carbon flux prediction problem, our goal was indeed achieved. We have found models

that represent the problem not perfectly, we cannot be sure about something like that

anyway, but at least adequately well. MLP and SVR models have been found with per-

formance almost as good as other work (Stubbs, van Wijk). Of course the results are

not really comparable since Stubbs and van Wijk modelled the process over a shorter

period of time and used hand-picked variables. If we take into account also that we

used no prior knowledge at all, for example in choosing the variables or for using some

63

Chapter 7. Conclusions and further work 64

known relationship between input variables to fill the missing values and that we build

our models for a period of about 18 months of observations, this study has been suc-

cesful.

It will be useful though to go through all the processing steps again to see in what way

they have affected the results and why our choices were finally justified or not. This

will be limited to the Harwood data set.

First of all, let’s talk about gapfilling. The method used was quite powerful. Under the

assumptions that were discussed in chapter 4 it looks like it did quite well. Of course

we can not be sure that the data was not biased from the gap filling process but as ex-

plained earlier this is something that we actually cannot do too much about. We think

that the EM algorithm is adequate for this study.

As far as noise removal is concerned, we could say that clearly our choice was not the

best possible. Especially, we should consider smoothing of every variable indepen-

dently and not get the number of averaging points for all the variables from just a few

variables. Some variables are less noisy than others. However, given the short amount

of time available, this is acceptable but should be avoided in future studies.

For dimensionality reduction we can say that the idea of using the linear model to eval-

uate the relevance of features is justified as far as computational cost is concerned but

as we said it is clearly not the optimal solution. Given the size of the real problem

though, it looks like this was a reasonable decision. It is clear though that it could

be that the direct wrapper approach gives significantly better results than the semi-

wrapper approach and therefore is worth trying despite the cost. Nevertheless, since

our approach is being used from other people with good results, we accept that it is

worth sacrificing a probably better solution for decreasing computational cost. Let’s

remember anyway that our goal is to explore for good models, that represent the pro-

cess sufficiently good, not necessarily the best one that is also very hard to find.

In addition, the decision to ultimately use the first four variables selected from forward

selection is a little arbitrary, although it was based on inspection of the error curve (Fig-

ure 5.2. It could be that the first 5 variables are indeed much better for the non-linear

problem, i.e. the seemingly little improvement that the addition of more variables gives

could result in much more essential for the modelling task information being contained


in the set of input variables. In this case, the non-linear method could probably get a

much better model. Thus, the important question is if all the necessary information for

modelling the problem is contained in the selected variables. The answer is probably

since we found adequately good models based on this set of variables. Considering

also that it would be possible to find even better models by finding better parameters

(especially for SVR) it looks like the necessary information is probably there. Never-

theless, we cannot be sure about that also. An investigation of the residuals of our best

models could shed some light to that.

For the modelling part, as we just said, it is indeed very hard to find the optimal model

given a set of input and output variables. We searched the space of models as thor-

oughly as we could given the time constraints, trying various different settings of pa-

rameters in a plausible range of values. Nevertheless, it is almost sure that we haven’t

found the best possible models given the input and target variables.

All these can just be summarised by saying that we feel confident that the process was

modelled adequately well but there is still much space for improvement. We propose

some ways to do this in the next section.

7.2 Future work

The most important thing that is worth trying is to try to apply the methods that have

been described to a bigger data set, like Griffin, but of course without spurious vari-

ables. This would give much more useful conclusions. However, we will talk here

about possible improvements that could be done for various parts of the analysis.

7.2.1 Data cleaning

For both data cleaning tasks that were performed we could try other approaches that

could give better results.

First of all, for gap filling using the EM algorithm there are two possibilities that should

enhance the quality of the filled values. The one option is taking into account also tem-

poral covariability, that means that the (much larger) covariance matrix and mean vec-

tor would be computed taking into account relationships between variables at different


time points also. This way more information would be incorporated in our estimation

and therefore we could expect more accurate estimates. This is described in (Schnei-

der, 2001). The other is to relax the rather strong assumption that the data follows a

simple gaussian distribution. It would be more realistic to say that the data follows a

mixture of distributions, i.e. we would suppose a mixture model for the data. This is

quite reasonable since probably variables have distinct different behaviors according to

seasonal changes, i.e. seasons are clusters and each of them has a different distribution.

The way to do this is described in (Ghahramani and Jordan, 1994b) and (Ghahramani

and Jordan, 1994a)

For noise removal also there are many choices for methods that can give more accurate

results and they have already been mentioned. The most important thing though is, as

we have already said, to consider each variable separately.

7.2.2 Dimensionality reduction

It would be worth considering a feature extraction method like PCA or LLE. It is

possible that these methods would maintain the important variability and information

required for modelling. Although models using the extracted features would be even

more hard to interpret than the ones that we already have, this is certainly an approach

that we could consider.

Also, since we finally wanted to try time series modelling, the feature selection algo-

rithms should also take this into account, i.e. we could let the algorithms also select

from variables that correspond to previous time steps as well.

It is also reasonable that a bigger number (or the whole set) of features selected from

the algorithms could be tried to see if the information that could be missing from our

models exists there.

7.2.3 Modelling

As far as modelling is concerned, except for the obvious choice of searching for better

parameters, a useful improvement would be to optimise the number of steps of time

series modelling. It would be also useful to try models where a ‘time of day’ variable


is included.

Furthermore, we could try different methods that look appropriate for our problem like

Radial Basis Function networks (Orr, 1996) or Hidden Markov Models (Rabiner and

Juang, 1986), whereas we could also use different design options for the methods that

we used: in MLPs we could try different transfer functions and in SVR we could try

different kernels.

Finally, it would be interesting to combine our methods with some existing parametric

model by trying to model the residual of the parametric model with some of our meth-

ods, like in (Dekker et al., 2001). Of course, we could also try to model the residuals of

our methods to see if there is any information present that we haven’t taken in account

when building the initial models.

7.3 General discussion

We have seen that totally automated extraction of models from features is feasible.

Indeed, the models were quite good although we used no prior knowledge about the

problem. The question is if a totally automated approach like this is always applicable

and when is it better to use prior knowledge and when not? Usually prior knowledge

about the problem gives better results. Nevertheless, this is not always true. Our

knowledge can be uncertain or incomplete (it could be like that in our problem) and

then we should probably explore the model space without use of this prior knowledge.

It is important to note though that in this case modelling is more difficult since there

are more choices that should be considered. The final conclusion is that in every case a

very good understanding of the problem is required and the modeller has to think very

seriously about the problem.

However, have to point out that modern machine learning techniques make modelling

without use of prior knowledge more effective. Their flexibility and wide applicability

makes them an ideal tool for modelling without use of prior knowledge. They can help

to discover new knowledge from raw data under the condition that the data contains no

inconsistencies (like the Griffin data set).

Bibliography

Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford University

Press.

Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition.

Data Mining and Knowledge Discovery, 2(2):121–167.

Chen, D., Hargreaves, N., Ware, D., and Liu, Y. (2000). A fuzzy logic model with ge-

netic algorithms for analyzing fish stock-recruitment relationship. Canadian Journal

of Fishery and Aquatic Sciences.

Christianini, N. and Shawe-Taylor, J. (2000). An introduction to support vector ma-

chines. Cambridge university press.

Collobert, R., Bengio, S., and Mariethoz, J. (2002). Torch: a modular machine learning

software library. Technical report, IDIAP.

Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning,

20(3):273–297.

Dekker, S., Bouten, W., and Schaap, M. (2001). Analysing forest transpiration model

errors with artificial neural networks. Journal of hydrology, (246):197–208.

Fujikawa, Y. (2001). Efficient algorithms for dealing with missing values in knowledge

discovery. Master’s thesis, School of knowledge Science, Japan advanced institute

of technology.

Ghahramani, Z. and Jordan, M. I. (1994a). Learning from incomplete data. Technical

Report AIM-1509.

68

Bibliography 69

Ghahramani, Z. and Jordan, M. I. (1994b). Supervised learning from incomplete data

via an EM approach. In Cowan, J. D., Tesauro, G., and Alspector, J., editors, Ad-

vances in Neural Information Processing Systems, volume 6, pages 120–127. Mor-

gan Kaufmann Publishers, Inc.

Hall, M. (1999). Feature selection for discrete and numeric class machine learning.

Technical report, Department of computer science, University of Waikato.

John, G., Kohavi, R., and Pfleger, K. (1994). Irrelevant features and the subset se-

lection problem. In Mavhine learning: proceedings of the eleventh international

conferrence.

Langley, P. (1994). Selection of relevant features in machine learning. In Proceedings

of the AAAI Fall symposium on relevance.

Lawrence, S., Giles, C., and Tsoi, A. (1996). What size neural network gives opti-

mal generalization? convergence properties of backpropagation. Technical report,

Department of electrical and Computer Engineering, University of Queensland.

Lek, S. and Guegan, J. (1999). Artificial neural networks as a tool in ecological mod-

elling, an introduction. Ecological Modelling, (120):65–73.

Little, R. J. A. and Rubin, D. B. (1987). Statistical analysis with missing data. Wiley.

Minsky, M. and Papert, S. (1969). Perceptrons: An Introduction to Computational

Geometry. MIT Press, Cambridge.

Mitchell, M. (1996). An introduction to genetic algorithms. MIT Press.

Orr, M. J. (1996). Introduction to radial basis function networks. Technical report,

Center for Cognitive Science, University of Edinburgh.

Parrot, L. and Kok, R. (2000). Incorporating complexity in exosystem modelling.

Complexity International.

Pyle, D. (1999). Data preparation for data mining. Morgan Kaufmann.

Bibliography 70

Rabiner, L. R. and Juang, B. H. (1986). An introduction to hidden markov models.

IEEE ASSP Magazine, pages 4–15.

Recknagel, F. (2001). Applications of machine learning to ecological modelling. Eco-

logical Modelling, (146):303–310.

Schneider, T. (2001). Analysis of incomplete climate data: Estimation of mean values

and covariance matrices and imputation of missing values. Journal of climate.

Smola, A. and Sch, B. (1998). A tutorial on support vector regression.

Stubbs, A. (2002). Modelling ecological data using machine learning. Master’s thesis,

School of Informatics, University of Edinburgh.

van Wijk, M. and Bouten, W. (1999). Water and carbon fluxes above european

coniferous forests modelled with artificial neural networks. Ecological Modelling,

(120):181–197.

Vapnik, V. (1995). The nature of statistical learning theory. Springer Verlag.

Vapnik, V. (1998). Statistical Learning Theory. John Wiley.

Vrugt, J., Bouten, W., Dekker, S., and Musters, P. (2002). Transpiration dynamics of

an austrian pine stand and its forest floor: identifying controlling conditions using

artificial neural networks. Advances in water resources, (25):293–303.

Wallet, B. C. and et al. (1996). A genetic algorithm for best subset subset selection in

linear regression.

Whigham, P. and Recknagel, F. (2001). An inductive approach to ecological time series

modelling by evolutionary computation. Ecological Modelling, (146):275–287.

Yang, J. and Honavar, V. (1999). Feature subset selection using a genetic algorithm.

Artificial Intelligence Group, Iowa State University.

applying machine learning techniques to ecological data

Documents