applying machine learning techniques to ecological data
TRANSCRIPT
Applying machine learning
techniques to ecological data
Georgios Petkos
Master of Science
Artificial Intelligence
School of Informatics
University of Edinburgh
2003
Abstract
This thesis is about modelling carbon flux in forests based on meterological variables
using modern machine learning techniques. The motivation is to better understand the
carbon uptake process from trees and find the driving factors of it, using totally auto-
mated techniques. Data from two British forests were used, (Griffin and Harwood) but
finally results were obtained only with Harwood because Griffin had spurious variables
in it. Both data sets presented significant challenges: missing values, noise and dimen-
sionality reduction. The missing value problem was addressed with the regularized EM
algorithm, whereas for filtering out noise, n-step moving averages was used. A range
of different ‘semi-wrapper’ and a filter method have been used for dimensionality re-
duction: forward selection, backward elimination, best ascent hill climbing, genetic
algorithms, evolutionary strategies and correlation-based feature selection. Modelling
was done with Multiple Linear Regression, Multilayer Perceptrons and Support Vec-
tor Regression. The best model found had at most 83% explained variance. Support
Vector Regression and Multilayer Perceptrons had almost the same performance and
were better than Multiple Linear Regression, since they managed to capture non-linear
details of the process.
i
Acknowledgements
I would like to thank the Bodossaki Foundation for funding my studies. It would not
be possible for me to be here without the Foundation’s help and I am grateful for the
chance that I was given.
I would also like to thank my supervisor, Dr. John Levine for his advice and support
when times were hard.
Many thanks to all the zata people: Jordi, Fransisco, Roman, Vicky, Alexandros,
Stathis! But most of all to Ioanna, Christophoros, Nikos, Giorgos. Finally, a big
thanks to Bill Steer, Jeff Walker, Ken Owen, Aaron Stainthorpe, Martin Powell, Andy
Craighan and Dan Swano for inspiration through the years.
ii
Declaration
I declare that this thesis was composed by myself, that the work contained herein is
my own except where explicitly stated otherwise in the text, and that this work has not
been submitted for any other degree or professional qualification except as specified.
(Georgios Petkos)
iii
To my mother that has been so tired through the years to teach me so many things.
iv
Table of Contents
1 Introduction 1
1.1 The problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Overview of the chapters . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Related work 3
2.1 Ecological Informatics . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Previous work on flux prediction problems . . . . . . . . . . . . . . . 4
3 The data 7
3.1 The Griffin data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 The Harwood data set . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 Data cleaning 13
4.1 The missing value problem . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Methods for handling missing values . . . . . . . . . . . . . . . . . . 16
4.2.1 List-wise deletion . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.2 Substitution with mean value . . . . . . . . . . . . . . . . . . 17
4.2.3 Nearest neighbor method . . . . . . . . . . . . . . . . . . . . 18
4.2.4 Regression methods . . . . . . . . . . . . . . . . . . . . . . 19
4.2.5 Other methods . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 The EM algorithm for gap filling . . . . . . . . . . . . . . . . . . . . 20
4.4 Suitability of the EM algorithm to our problem . . . . . . . . . . . . 23
4.5 Noise / smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
v
5 Dimensionality reduction 29
5.1 The dimensionality reduction problem . . . . . . . . . . . . . . . . . 30
5.2 The semi-wrapper methods . . . . . . . . . . . . . . . . . . . . . . . 32
5.2.1 Forward selection . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.2 Backward elimination . . . . . . . . . . . . . . . . . . . . . 34
5.2.3 Best ascent hill climbing . . . . . . . . . . . . . . . . . . . . 35
5.2.4 Genetic search . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2.5 Evolution strategies . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 The filter method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3.1 Correlation-based feature selection . . . . . . . . . . . . . . . 40
5.4 Results of the feature selection methods . . . . . . . . . . . . . . . . 41
5.4.1 Feature selection results for the Griffin data set . . . . . . . . 41
5.4.2 Feature selection results for the Harwood data set . . . . . . . 43
5.5 Discussion about the results of feature selection . . . . . . . . . . . . 45
6 Modelling 46
6.1 Modelling techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.1.1 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . 47
6.1.2 Multilayer Perceptrons . . . . . . . . . . . . . . . . . . . . . 48
6.1.3 Support Vector Regression . . . . . . . . . . . . . . . . . . . 51
6.2 Modelling design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.2.1 Parameter optimization . . . . . . . . . . . . . . . . . . . . . 56
6.2.2 Time series modelling . . . . . . . . . . . . . . . . . . . . . 57
6.2.3 Performance evaluation . . . . . . . . . . . . . . . . . . . . . 58
6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.3.1 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . 58
6.3.2 Multilayer Perceptron . . . . . . . . . . . . . . . . . . . . . 59
6.3.3 Support Vector Regression . . . . . . . . . . . . . . . . . . . 60
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7 Conclusions and further work 63
7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
vi
7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.2.1 Data cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.2.2 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . 66
7.2.3 Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.3 General discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Bibliography 68
vii
List of Figures
3.1 Missing values in the Griffin data set . . . . . . . . . . . . . . . . . . 8
3.2 Missing values in the reduced Griffin data set . . . . . . . . . . . . . 10
3.3 Missing values in the Harwood data set . . . . . . . . . . . . . . . . 12
4.1 Filled parts for the Fc and the m Ustr L0 L0 variables of the Griffin
data set. The variables maintain a behaviour in the filled part similar
to the one that is observed in the present part . . . . . . . . . . . . . . 24
4.2 Filled parts for the tFc and the tMeanU variables of the Harwood data
set. The variables maintain a behaviour in the filled part similar to the
one that is observed in the present part . . . . . . . . . . . . . . . . . 25
4.3 Not smoothed, 3 points, 5 points and 10 points moving average for the
first 300 points of the variable tmvpd of the Harwood data set . . . . . 27
5.1 MSE as forward selection adds variables for the Griffin data . . . . . 42
5.2 MSE as forward selection adds variables for the not smoothed and 5
point smoothed Harwood data . . . . . . . . . . . . . . . . . . . . . 44
6.1 The least squares solution given a set of observations�xi � yi � . . . . . . 47
6.2 A linealy separable problem (left) and a non-linearly separable prob-
lem (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.3 The output of a neuron with a sigmoid transfer function. . . . . . . . 49
6.4 A MLP with one input layer, one hidden layer and one output layer . . 50
6.5 (a) Some separating hyperplanes and the maximum margin one (b) The
maximum margin hyperplanes H, H � and H � . . . . . . . . . . . . . 52
viii
6.6 Transformation from the input space to the feature space where the
problem is linearly separable . . . . . . . . . . . . . . . . . . . . . . 54
6.7 The epsilon function . . . . . . . . . . . . . . . . . . . . . . . . . . 56
ix
List of Tables
5.1 The forward selection algorithm . . . . . . . . . . . . . . . . . . . . 34
5.2 The backward elimination algorithm . . . . . . . . . . . . . . . . . . 34
5.3 The best ascent hill climbing algorithm . . . . . . . . . . . . . . . . . 35
5.4 A genetic algorithm for feature selection . . . . . . . . . . . . . . . . 37
5.5 Evolutionary strategy for feature selection . . . . . . . . . . . . . . . 39
6.1 The results for MLR . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2 The results for MLPs . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3 The results for SVR . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
x
Chapter 1
Introduction
1.1 The problem
The task of this study is to model carbon flux in a forest based on various physical
measurents. This problem is of crucial importance in ecology since better understand-
ing of the carbon uptake process from forests could potentially help to better face one
of the main ecological threats, the greenhouse effect. Although this is a much studied
problem by ecologists and biologists, the methods that are used are mainly non-flexible
parametric models that result in insufficiently good results. Our aim is to investigate
the application of more elaborated machine learning modelling techniques to the car-
bon flux prediction problem. Machine learning methods, and especially the ones that
we plan to use, Support Vector Regression and Multilayer Perceptrons, have the ability
to represent complex relationships between variables and therefore look promising for
our task.
The focus of this study is in producing models without use of prior knowledge about
the problem, in a totally automated way. That is, we have a collection of data and we
explore it to try to find good models. Thus, the problem is seen from the point of view
of a non-expert in relevant biological issues. We make no modelling decisions based
on knowledge of the process of photosynthesis or transpiration, for example. We just
explore collected data sets looking for models that explain the observed values for car-
bon flux.
1
Chapter 1. Introduction 2
Two data sets were used: the Griffin and the Harwood dataset. Focus was initially on
Griffin but we had to shift to Harwood. The challenges that these data sets presented
were many: missing values, noise, spurious variables, selection of relevant variables,
difficulties in training the models. All of these are thoroughly presented in the next
chapters, an overview of which is given below.
1.2 Overview of the chapters
Chapter 2 introduces the general field of Ecological Informatics, that this project is
involved in, and describes some previous work on the application of machine learning
techniques to carbon flux prediction in forests. We also point out some issues about
previous studies and their limitations.
Chapter 3 gives the basic information about the two data sets used, the Griffin and the
Harwood data set. The basic outline of data processing is sketched based on the prop-
erties of the data sets and we get a clear idea of the obstacles that we have to face to
find good models.
Chapter 4 discusses two important steps of preprocessing: handling missing values and
noise removal. Both are part of the data cleaning part of processing. This is necessary
for making modelling possible. It is also an extremely sensitive procedure since it can
alter the real information contained in the data and lead to wrong models.
Chapter 5 discusses the problem of dimensionality reduction and selection of relevant
features from the data. Various methods are presented. Finally we decide about the
features that will be used for the final modelling part.
Chapter 6 describes the machine learning modelling techniques that were used, Sup-
port Vector Regression and Multilayer Perceptrons, along with Multiple Linear Re-
gression that was used for feature selection and for comparative modelling. We also
discuss our modeling design choices and present the results of modelling.
Finally, chapter 7 critically discusses the whole process of data analysis, presents some
final conclusions for our study and sketches out some directions for further research
and improvement of the methods used.
Chapter 2
Related work
2.1 Ecological Informatics
The modelling problem of this project is involved in the novel field of Ecological In-
formatics. A definition of Ecological Informatics appearing in the web site of the
International Society For Ecological Informatics1 is:
Ecological Informatics is defined as interdisciplinary framework promot-ing the use of advanced computational technology for the elucidation ofprinciples of information processing at and between all levels of complex-ity of ecosystems - from genes to ecological networks - and aiding trans-parent decision-making in relation to important issues in ecology such assustainability, biodiversity and global warming.
Therefore, Ecological Informatics concerns the use of modern modelling and com-
puting techniques on problems of ecological interest. In this context, modern artifi-
cial intelligence techniques and in particular machine learning techniques should be
applicable. Indeed, machine learning techniques are becoming a useful tool for peo-
ple working in the field. Researchers have used artificial neural networks (Lek and
Guegan, 1999), evolutionary algorithms (Whigham and Recknagel, 2001), cellular au-
tomata (Parrot and Kok, 2000) and fuzzy logic (Chen et al., 2000), among others, to
various ecological modelling problems with good results.
Although Ecological Informatics is a novel field and there hasn’t been extensive work
1http://www.waite.adelaide.edu.au/ISEI/
3
Chapter 2. Related work 4
so far, it looks like there is a constantly growing interest on the application of various
machine learning techniques to ecology related problems. This is reflected in the activ-
ity of the International Society For Ecological Informatics, that has already organized
three conferences (1998, 2000 and 2002). The first conference focused on the applica-
tion of artificial neural networks to ecological modelling problems (Lek and Guegan,
1999). However, in the next two conferences, work with a much wider variety of meth-
ods was presented, indicating the growing interest of researchers of the area in novel
techniques (Recknagel, 2001).
2.2 Previous work on flux prediction problems
The range of ecosystems on which various such modelling methods have been success-
fully applied is quite big: lakes, rivers, forests and many more. It seems promising that
machine learning methods can be useful for a large variety of problems. Nevertheless,
there is not much previous work on problems related with flux prediction in forests.
Below we outline some related studies that were found.
Artificial neural networks were used in (Vrugt et al., 2002) to identify the main driving
factors contained in a small set of variables that determine the levels of forest floor
water transpiration and total forest water transpiration. The best models for floor for-
est water transpiration achieved about 80% explained variance using as inputs global
radiation, air temperature and average water content between 0 and 2 meters under
the surface of the ground, whereas the best models for total forest water transpiration
achieved 84% explained variance using as inputs global radiation, air temperature and
average water content between 0 and half meter under the surface of the ground.
Another example of application of a machine learning method to a water transpiration
problem can be found in (Dekker et al., 2001). In this study the residual of a para-
metric model for forest transpiration (Single Big Leaf) was modeled using artificial
neural networks. It was found that part of the trend in the residuals was explained from
wind direction and speed. These were not included in the parametric model. It was
also found that a big part of the residuals could be attributed to noise produced by the
Chapter 2. Related work 5
measurement method.
A more related to our problem study can be found in (van Wijk and Bouten, 1999).
van Wijk models water and carbon flux of a forest using artificial neural networks. He
is interested in finding site-independent models. This is actually the ideal goal that we
couldn’t go for in our study. The inputs to the artificial neural network he used were:
radiation, temperature, vapour pressure deficit and time of day. He used different com-
binations of these and also for the carbon models he used a ‘Leaf Area Index’ variable
in order to make site-independent prediction possible. He also did a lot of prepro-
cessing, removing data points that would probably be prone to noise, making the task
easier. Water flux predictions were more accurate than carbon flux predictions. The
best model for water flux had 90% explained variance and the best model for carbon
flux had 87% explained variance. It is worth noting though that the data used in that
model were limited to a period of 41 days. Getting good results in this case was not
very difficult since there are not significant seasonal changes and probably carbon flux
maintains a regular behavior in such a short amount of time. On the other hand, mod-
elling the process for a longer period of time, like in our experiments, is much more
difficult since big changes in the behavior of the forest appear at various times of the
year according to climatic changes.
Finally, (Stubbs, 2002) is an MSc thesis dealing with carbon flux prediction. In this
study, we can find the first application of Support Vector Regression to a carbon flux
prediction problem. He also used Multiple Linear Regression for comparison. The
variables used were the same that van Wijk used except for Leaf Area Index and var-
ious combinations between them were tried. The data set used was the Harwood data
set, one of the data sets that were used here also and about will be presented in the
next chapter. The best Support Vector Regression model found explained 89% of the
variance but again it was limited in data that corresponded to observations of a short
period, about one month and a half.
In all those studies, various combinations of only three or four basic variables were
used for prediction. In all of them, there was some prior knowledge about what the
important factors could be for modelling the carbon uptake process. The different ap-
proach that our study introduces is that we use no prior knowledge in model building.
Chapter 2. Related work 6
We do not choose the input variables of our models using any knowledge about the
process and we just search for the critical variables.
Chapter 3
The data
Two data sets were used, the Griffin data set and the Harwood data set. Our initial
intention was to use the Griffin data set and face the big challenges it presented. Un-
fortunately though, at the end of preprocessing, it was found that it contained spurious
variables. The exact details of this are unknown but at least some of the variables were
found to be just duplicates of others with very little statistical processing. This made
all results obtained this far invalid. Finding the spurious variables and excluding them
could be possible but very expensive. Thus, given the short amount of time available
when this was discovered, it was decided to stop working with the Griffin data set. At
that point, we switched to the smaller Harwood data set and repeated the preprocessing
steps. Nevertheless, in this text we will describe work done on both data sets because
various interesting issues appeared for both of them. In this chapter we will discuss
their basic properties and sketch briefly further processing steps for both of them.
3.1 The Griffin data set
The Griffin data set1 comes from Griffin Forest, located at Aberfeldy in Scotland.
This is a forest with 20 year old trees and the dominant species is Sitka spruce (Picea
sitchensis). The collection of the data was part of the EUROFLUX project that collects
carbon flux and related meteorological data from various forests around Europe.
1http://carbodat.ei.jrc.it/data_arch_site_indiv.cfm?db_id=11
7
Chapter 3. The data 8
It is a large data set with 310 variables and 102645 datapoints. The meaning of the
variables was not known to us but they were supposed to be half-hourly measurements
spanning over a period of six years and physical quantities computed from the mea-
sured variables. We only knew the meaning of two variables. The one was carbon flux
and the other was a corrected version of carbon flux that should be excluded from our
analysis. Unfortunately, some of the other variables were also spurious.
Many issues arose when we decided to use the Griffin data set. First of all was its
size. Initially it was about 240 MB big. Sophisticated algorithms are computationally
very expensive for a data set of this magnitude. In addition, it had many missing val-
ues. About 40% of the values were missing. Considering the fact that the modelling
tools that we intend to use do not work with missing values, an appropriate method for
handling missing values was essential. Figure 3.1 can give an idea about the missing
value problem in the Griffin data set. Every variable is a row and time goes through the
horizontal axis. A white point represents missing values. It is obvious that in the begin-
200 400 600 800 1000 1200 1400 1600 1800 2000
50
100
150
200
250
300
Figure 3.1: Missing values in the Griffin data set
ning and the end of the 6 year period, data are rather sparse. Given the size of the data
set and the cost of further processing, it was decided to ignore those big empty parts.
Chapter 3. The data 9
One could argue that this is a bad decision. Doing this, we ignore information that is
present in the data set and this could mean that our models may not be as accurate as
they could be. If we consider it again though, cutting those two parts looks like a good
idea. This is because the data set describes a phenomenon in time. The phenomenon is
almost periodical. Similar phenomena appear daily and also seasonal changes should
affect carbon flux in the same way every year. Considering also the fact that after this
reduction we still have about 60000 datapoints over a period of more than three and
a half years it looks quite probable that information contained in the ignored part is
also present in the part that is kept. On the other hand, there could be some distinct
difference in the information contained in the part of the data set that was ignored and
the part that was kept. This could be, for example, just by luck or because the trees
get older. In addition, if we plan to do gap filling, we could expect that for datapoints
with very few present values, like the ones that appear in the beginning and the end of
the six years period, the missing values will probably not be very well estimated. This
could result in erratic models and further justifies the decision to discard the largely
empty parts in the beginning and the end of the sic years period.
Furthermore, it was found that a few variables had 0 variance, i.e. they had a constant
value. Although we did not know their meaning, it was obvious that these variables
could not be good predictors of carbon flux and were discarded. Also, the variables that
had missing values in a very big part of the datapoints were discarded. This is because
in gap filling we will make estimates of the missing values based on the relationship
between the variable that is missing and the others. If that variable is missing most
of the time, we don’t have a good sample to learn the relationship between that vari-
able and the others and our estimates will be poor. Probably also, relevant information
contained in those variables can be extracted from the others, since we already knew
that most of the variables have a strong relation at least with some of the others. We
have set the threshold to discard a variable to 97% missing values. All these reductions
look quite hazardous but there is a lot to be gained (a cleaner and smaller data set) and
probaly very little to be lost (some loss of information maybe).
Finally, the final data set had 281 variables and about 60000 datapoints. Missing values
are still present to quite a big extent. About 15% of the values are missing. This can
Chapter 3. The data 10
be seen in Figure 3.2. Handling missing values is still an important issue. This will be
200 400 600 800 1000 1200
50
100
150
200
250
Figure 3.2: Missing values in the reduced Griffin data set
discussed in the next chapter.
Another important issue is the high dimensionality of the data. There are too many
variables to use in a complex machine learning algorithm directly. The computational
cost of training is going to be very big. Also, it is quite probable that at least some of
the variables are not relevant with the modelling problem and should be excluded from
further analysis. Picking only the necessary variables alleviates the problem of over-
fitting. The modelling algorithm could be mislead by irrelevant variables and result
in modelling a false relationship between the target variable and the irrelevant ones.
Dimensionality reduction will be discussed in a later chapter.
Another issue was noise. Information about the source of noise in the data was not
available but since noise is almost always present in physical measurements, some-
thing should be done about it. This will be discussed later again, after we are done
with the missing value problem.
Concluding, we can say that the size of the Griffin data set, it’s high-dimensionality
and the big number of missing values that appear with no regularity posed a very big
Chapter 3. The data 11
challenge.
3.2 The Harwood data set
Harwood2 was a much smaller data set. It had observations for 3 different sites, all
located in Northumberland, Scotland. The objective of collecting these data was to
measure carbin flux flux on trees of different ages. The ‘d’ site had mostly weeds, the
‘h’ site had 7 year old trees and the ‘t’ site had 30 year old trees. For the ‘d’ site we had
20, for the ‘h’ site we had 19 and for the ‘t’ site we had 18 variables measured. This
time there were no spurious variables and this was checked by plotting the variables
together. For each of the sites there were 26581 datapoints, each one corresponding to
a half-hourly measurement. The data set covers a period of about 18 months.
Similarly to the Griffin data set, the Harwood data set had quite a lot of missing values.
This can be seen in Figure 3.3. Especially the ‘d’ site and the ‘h’ site are very sparse.
The data for the ‘t’ site was the most complete. About 80% of the values are present.
Therefore, it was decided that the data for the ‘t’ site would be used. Here we did not
consider reducing directly the dataset because there is not a part in time that is very
sparse and also there are not very sparse or with very low variance variables. Since this
is a smaller dataset, something like that could lead to significant loss of information.
Handling the missing values was again essential before proceeding to modelling and
noise removal was also an issue that will have to be considered. In addition, dimen-
sionality reduction, although not essential anymore from the computational cost point
of view, should be performed in this data set as well. We believe that not all variables
are important for modelling and therefore better results will be obtained if we pick the
best of them.
Concluding, the Harwood data set was rather small and had different properties than
Griffin. The problem was easier but still quite challenging.
2http://www.bgc-jena.mpg.de/public/carboeur/sites/harwood.html
Chapter 3. The data 12
0.5 1 1.5 2 2.5
x 104
5
10
15
20
0.5 1 1.5 2 2.5
x 104
5
10
15
0.5 1 1.5 2 2.5
x 104
5
10
15
d−site
h−site
t−site
Figure 3.3: Missing values in the Harwood data set
Chapter 4
Data cleaning
Missing values was a major issue in this study. The Griffin data set originally had
about 40% missing values (Figure 3.1) and after the partial reduction this decreases to
about 15% (Figure 3.2). A complete data set was needed for both the dimensional-
ity reduction and the modelling algorithms, therefore a method for handling missing
values should be used. Considering the fact that an inappropriate method can bias
the information contained in the data set and further processing can result in incorrect
models, it is clear that handling the missing values is a very important part of data
preprocessing.
For the Griffin data set, a few approaches were tried (simple linear modelling, artificial
neural networks and a nearest neighbor estimator) but finally a regularized Expectation
Maximization (EM) algorithm (Schneider, 2001) was used. For the Harwood data set,
only the EM algorithm was used, because it had already proved to be a good solution
on the Griffin data set and gave, based on basic statistics and visual inspection with
plots, seemingly good results.
Although we cannot be really sure that a specific method does not distort the actual
information contained in the data set, we have some faith that the bias introduced is
not significant since this is considered to be one of the best gap filling methods.
Another important issue was the presence of noise. Better models can be found if we
filter out noise in a suitable way. Smoothing the data and trying to fit models using the
smoothed data is an option that should be tried.
In this chapter we will discuss both issues. Both are data cleaning tasks, quite im-
13
Chapter 4. Data cleaning 14
portant in data preprocessing. First we will talk about the nature of the missing value
problem in general and in the data sets of this study specifically. We will next briefly
examine some commonly used methods for handling missing values and their appli-
cability to our problem. Next we will give a short description of the regularized EM
algorithm that was actually used to get the complete data sets used for the rest of this
project and discuss about its suitability to these data sets. Finally we will talk about
filtering out the noise from the data.
4.1 The missing value problem
The presence of missing values is a very important issue in data preprocessing. There
are not many modelling tools that can deal with missing values (decision trees can deal
with missing values for example), therefore handling missing values is an essential
preprocessing task if a model that cannot directly handle missing values is to be used.
Handling missing values is a very tricky problem though. The modeller must have
some confidence that the method he uses doesn’t harm much the information contained
in the observed values, i.e. the method used doesn’t add patterns in the completed
data set that are not present in the incomplete data set. We want just to maintain the
information contained in the incomplete data set and don’t want to add any artificial
information generated from the method used. If an inappropriate method is used, then
the results obtained from subsequent processing steps will be biased from this method
and they will not be reliable. Obviously this is not a desirable situation but it turns out
that it is very difficult to be sure about the suitability of a specific method.
When having to cope with a missing value problem we have to consider some basic
things. First we have to think if the data is missing at random (MAR), i.e. if there
is not a systematic reason for the absence. For example, in a data set collected from
job applications and the previous employer field is left empty, we will probably not
consider that this is just missing at random. There is probably some reason for the
absence of the data and this conveys information. If the data is MAR then there is
no information in the absence of the data and we can fill in the missing values with
estimates based on the observed data without losing any information. In our case, the
Chapter 4. Data cleaning 15
data is MAR, there is no hidden information in the absence of the data since the reason
for their absence is unavailability of physical measurements. This is valid for both data
sets. Thus, we can fill in the missing values with estimates.
In this step, we also have to ask the question if the missing variables are related with
the others that are not missing. If they are not, we can not predict the missing values
based on the others in a consistent way, the information is really missing. On the other
hand, if they are related there are some methods that can be used and will be described
in the next section. In our case we believe that there are strong relations between the
variables and therefore we can model the missing values based on the variables that
are present.
The previous is not true if the data is periodic and therefore missing values can be
filled based on previous values of the same variable and not on other variables. Even
in this case though, if the gaps are big and there is a little noise, prediction becomes
very difficult and the estimates of missing values are klikely to be inaccurate. In both
data sets, although most variables present some periodic patterns, this approach would
not be method because we have quite big gaps. Another thing that we have to consider
when we deal with a missing value problem is the cost of the possible solution. There
is a trade-off between the time required to compute estimates for the missing values
and the quality of the estimates. We have to make a compromise between the two. The
cost of the method used is determined from a variety of factors. Some of them are
examined below.
First is the size of the data set. This can be very important for choosing a method to
handle missing values. Not all gap filling algorithms scale well to bigger data sets,
whereas the time needed for some algorithms grows very fast with the size of the data
set. The application of some algorithms may even be totally infeasible for large data
sets. The Griffin data set is quite big so ideally we would like to use a method that is
not too expensive for a large data set but at the same time produces sufficiently good
results. The situation is not the same problem with the Harwood data set, it is many
times smaller than Griffin and an expensive but accurate method could be used easier.
Another issue, related to the cost of the method used is the number of missing values
and the number of different patterns of missing values appearing in the data set. If the
Chapter 4. Data cleaning 16
number of missing values is not that big, we could use for a big data set a simpler but
possibly less accurate method than a more sophisticated one. In a case like that, further
processing of the data would probably not be largely affected. Of course, if the data
set is small and at the same time there are not many missing values, a sophisticated
but expensive method should probably be used. In the Griffin data set, initially 40%
of the values were missing and after the partial reduction this was decreased to about
15%, which is still a quite big fraction of the data set. Therefore we should prefer to
use a method that is going to give quite good estimates. The case is the same with the
Harwood data set, we have 20% missing values and we need quite good estimates for
them. Also, the number of different missing value patterns is quite important since it
can increase the complexity of many gap filling algorithms. That was a major problem
in some of the attempts for the Griffin data set as it will be described in the next section.
Concluding, we can say that the missing value problem in this study was not trivial for
the Griffin data set because of the size of the data set, the number of missing values and
the absence of a regular missing value pattern. The size of the data set directly makes
the problem computationally expensive, the number of missing values requires very
good estimates and probably an expensive method whereas the absence of a regular
missing value pattern ultimately makes many gap-filling methods too complicated. On
the other hand, the missing value problem for the Harwood data set was much simpler.
Althouth a big part of it had missing values, it was much smaller and did not have a
very big number of missing value patterns.
4.2 Methods for handling missing values
Many methods have been used for handling missing values. Each of them has its own
properties and therefore is suitable for some problems and unsuitable for others. Below
we will present some of the most widely used methods for handling missing values and
we will discuss their suitability to our problem.
Chapter 4. Data cleaning 17
4.2.1 List-wise deletion
In this method we just discard the datapoints that contain missing values keeping only
datapoints that contain all the values. This is a very commonly used method but it
has the main disadvantage that it largely biases the data set if a large part of the data is
missing. This is quite reasonable since it is quite probable that the datapoints discarded
contain information that is not present in other datapoints. This can lead to inaccurate
models.
In some sense we used this method by discarding the largely empty part of the Griffin
data set at the beginning and the end of the 6 year period but for the reasons mentioned
in the previous chapter, this was not an unreasonable decision. This would not be
possible in the reduced data set because there was not even one datapoint in it that
had all values available. Even if there were some, this would probably be disastrous
for the final data set used. This method is also not applicable to the Harwood data set
because, although there are datapoints that have all 18 variables present, they are not
enough and the remaining datapoints would probably not carry the necessary details of
the process we want to model.
4.2.2 Substitution with mean value
Here we just substitute each missing value with the mean value for the specific variable,
obtained from the values that are present. In the case of discrete variables, we just fill
in the missing value with the most numerous value. This is also a very commonly used
method. It is also very fast. Its main disadvantage is that it can also introduce a large
bias to the data unless there are not many missing values. It is obvious that this method
reduces the variance of the variables and therefore reduces the real information they
carry. In a data set with very few missing values, this could be a fast and sufficient
solution but not for our data sets that have 15% and 20% missing values.
There are a few varieties of this method that give slightly better results. Those methods
attempt to cluster datapoints according to the variables that are present and give to the
missing values the mean of the present values for that cluster. This is quite better
than the simple approach, since it preserves a part of the variance of the variables
Chapter 4. Data cleaning 18
but it is not very suitable in the case of the Griffin data set where there is a very big
number of missing value patterns. It is very difficult to decide which variables to use
for clustering when there is not a constant pattern of missing variables. We discuss
about this problem next, when we discuss about the nearest neighbor method. On the
other hand, this modified version could be suitable for the Harwood data set.
4.2.3 Nearest neighbor method
This is another commonly used method for gap filling. When a missing value is
encountered, the algorithm looks at the rest data for the most similar datapoint that
doesn’t have a missing value for the variable that is missing and then fills in the miss-
ing value with the value that this datapoint has. The similarity of two datapoints is
usually measured with the euclidean distance but other measures are sometimes used.
Calculation of the distance is based on the variables that are available for both in-
stances. If we have k features and xi declares the ith feature of the datapoint x, then the
euclidean distance between two datapoints x and y is given by:
k
∑i
�xi � yi � 2
This is a method that gives quite good results but is expensive for large data sets.
This approach has been tried but it couldn’t be efficiently applied to the Griffin data
set because it doesn’t have a regular missing value pattern. Most of the variables are
missing at some points and it is hard to find a sufficiently big subset of the datapoints
that has the same present variables as the datapoint that we want to fill in and at the
same time has a value for the variable that is missing in the other. It is clear that in
most cases, if this subset set of datapoints is quite small, the filled value will be a poor
estimate. Let’s suppose that we want to fill the missing values for a datapoint and let’s
refer with xp to the variables with present values and with xm to the variables with
missing values. We would have to look for the closest neighbors for the datapoint that
have not only xp present but also some of xm (in the Griffin data set we can’t have all
because we don’t have complete datapoints). We could consider to find the nearest
neighbors based on a subset of xp so that we can get a bigger subset of comparable
datapoints but still it is very difficult to figure out which variables to include and which
Chapter 4. Data cleaning 19
not. Thus, this approach was abandoned for the Griffin data set when those problems
became apparent in practise. This would not be a very big problem for the Harwood
data set but it wasn’t tried because of the short amount of time available. We decided
to use the regularized EM algorithm that had already given good results for the Griffin
data set (section 4.3).
4.2.4 Regression methods
This is a class of very popular techniques. Here we try to model the missing values xm
based on the observed values xp. Various regression methods are used for that purpose,
for example linear regression and artificial neural networks. Both linear regression and
artificial neural networks were tried for the Griffin data set but they both failed for
the same reason that the nearest neighbor approach failed. The difference here is that
we believed that the flexibility of the models would allow us to predict at least with
rough accuracy sufficiently good estimates for the missing values. Especially, it was
expected that the neural network, given its large modelling capacity, would be able
to give quite good results. The main disadvantage of these methods is that the model
used for infering missing values is usually noise free and therefore underestimates the
variance in the data.
Again, it was difficult to find a good subset of the datapoints to use as training examples
and estimate the parameters of the models. In most cases, the training examples were
too few and therefore it was hard to catch the real details of the relationships between
variables and perform good estimations. Using both kinds of models, the first results
showed that some totally out of range values were estimated for the missing values and
therefore, this approach was also abandoned. Due to lack of time, this was not tried
in the Harwood data set, but we expect that it would perform better here for the same
reason that the nearest neighbor approach would perform better with it.
4.2.5 Other methods
Some other methods that have been used for dealing with the missing value problem
are: autoassociative neural networks, decision tree estimation and multiple imputation.
Chapter 4. Data cleaning 20
For more details see (Pyle, 1999), (Fujikawa, 2001).
4.3 The EM algorithm for gap filling
The EM algorithm (Little and Rubin, 1987) is a generic iterative maximum likelihood
parameter estimation algorithm. It has been used for the training of mixture mod-
els and bayesian networks, estimation of probability density function parameters with
missing values and of course for filling in missing data in incomplete data sets. The
algorithm is slightly different in these applications but in all cases it has the same ba-
sic two step iterative structure. There is always the Expectation Step (E-Step) and the
Maximization Step (M-Step). We will describe these steps for the gap filling algorithm
specifically.
Finding the maximum likelihood parameters of the distribution of an incomplete data
set is closely related to filling in the missing values. If we assume a probability density
function for the data and we know the maximum likelihood estimates of the parameters
of the function we can fill in the missing values based only on the observed ones. The
filled in value is the conditional expectation value. This means that we fill the missing
values with the expected values according to the estimated distribution and the values
that are observed for each datapoint. Filling the missing values using the estimated
parameters of the probability density function with the conditional expectation values
is the E-Step of the EM algorithm. On the other hand, when we have missing values,
computing the estimates of the parameters just by ignoring the missing values will re-
sult in very poor and biased results. This is because a large part of the information that
is available in the data is ignored. If we have a completed data set though, obtaining
estimates of the parameters for the completed data set is straightforward and can be
done easily. This is done in the M-Step of the algorithm: we get new estimates for
the parameters using the filled data set. We then use those new estimates again to fill
the missing values (we repeat the E-Step) and then again we reestimate the parame-
ters and so on. We stop when either the filled in values or the estimated parameters
don’t change much from one iteration to the next. This iterative procedure will give
good maximum likelihood estimates of the parameters and missing values. One iter-
Chapter 4. Data cleaning 21
ation would not be enough, we have to repeat the steps until the algorithm converges.
Summarising, the EM algorithm for gap filling consists of the following two steps:
1. Fill the missing values with their conditional expectation values based on the
observed values and using estimated distribution parameters (E-Step)
2. Reestimate the distribution parameters based on the filled in values (M-Step)
It is important to notice that if we had the maximum likelihood estimates of the pa-
rameters of the distribution, it would be sufficient just to compute the conditional ex-
pectation values for the missing values of each datapoint. If we just ignore the missing
values when we compute estimates of the distribution parameters the result will be bi-
ased. Since we don’t know these, we have to go through the iterative procedure of EM
so that we get their maximum likelihood estimates.
The choice of the probability density function is probably quite arbitrary but usually
the Gaussian is used. This is reasonable for many cases. Of course, other probability
density functions can be used. In this study, the multivariate Gaussian was used. The
multivariate Gaussian is:
p�x � µ � Σ ��� 1
det�
2πΣ � exp � 12
�x � µ � T Σ � 1 � x � µ �
The multivariate Gaussian has two parameters, the covariance matrix Σ and the mean
vector µ. If we have a complete data set the maximum likelihood parameters of the
distribution are:
µ � 1N
N
∑i � 1
xi
Σ � 1N
N
∑i � 1
�xi � µ � T � xi � µ �
where N is the number of datapoints.
Having examined roughly the basic idea of the EM algorithm for gap filling and esti-
mation of distribution parameter estimation with missing data, it is time to see in detail
how the steps of the EM algorithm work. We will examine the case of the Gaussian
distribution.
We start with an initial guess for the covariance matrix and the mean vector. Us-
ing these we go through every datapoint that has missing values and we compute the
Chapter 4. Data cleaning 22
conditional expectation values for the missing values. If for a specific datapoint the
available values are xp and the missing values are xm then we look for the values that
maximize p�xm � xp � Σ � µ � . The estimates of the missing values are computed from:
xm � µm � �xp� µa � B
where µm is the part of the mean vector corresponding to the variables that are missing,
and µa is the part of the mean vector corresponding to the variables that are available.
B is the estimated regression coefficients matrix and is:
B � Σ � 1pp Σmm
Using the formulas above we can fill in the missing values. The next step is to rees-
timate the covariance matrix and the mean vector. Reestimating the mean vector is
straightforward, we just use the maximum likelihood estimator shown previously. For
the covariance matrix though, we have to be more careful. If we just compute it again
using the maximum likelihood estimator we underestimate it. This is similar like the
situation we encountered previously when we talked about the regression based gap
filling methods. The regression function is virtually noise free and can lead to bad es-
timation of the real values leading further to underestimation of the covariance matrix.
For this reason, we need an estimate of the resisual covariance matrix C:
C � Σmm� ΣmpΣ � 1
pp Σpm
where Σmm � Σmp � Σpm � Σpp are partitions of the estimate of the covariance matrix con-
sisting of the set of rows and columns corresponding to the present and missing values
of the datapoint. This will have to be included in the new estimate of the covariance
matrix.
When we get the new estimates of the covariance matrix and the mean vector we repeat
the E-Step and then again the M-Step until in two subsequent iterations the parameters
or the filled in values don’t change much.
In this project, a variation of the simple EM algorithm for missing data was used. That
is the regularized EM algorithm. The only difference between this and the simple al-
gorithm is that here the regression coefficients are computed in a slightly different way.
Chapter 4. Data cleaning 23
More specifically, the regression coefficients are now:
B � �Σpp � h2Diag
�Σpp � � � 1Σmm
That is, instead of using Σ � 1aa we use a regularized version of the same matrix. This
method is known as ridge regression and can result in models with better generaliza-
tion results. The regularization parameter h2 controls how smooth the function will be
and given that an appropriate value is found we get better predictions and better gener-
alization performance. The regularization parameter is determined using generalized
cross-validation. More details about generalized cross validation and the regularized
EM algorithm can be found in (Schneider, 2001). Further information about the EM
algorithm and its use in gap filling problems can be found in (Little and Rubin, 1987).
For running the regularized EM algorithm, some Matlab functions written by the au-
thor of (Schneider, 2001) were used.
4.4 Suitability of the EM algorithm to our problem
As stated earlier, it is not easy to be sure about the results of a method for handling the
missing values. We have examined a few methods, we discussed some arguments for
and against each of them and we finally decided to use the regularized EM algorithm
that was described in the previous section. The EM algorithm as we have described it
works very well when there is a not too big part of the data missing, the data is MAR
and it is reasonable to say that the data follow a simple Gaussian distribution. In our
case the part of values that is missing is indeed not too big, the data is MAR but we are
not really sure that the data can be adequately described with a Gaussian distribution.
Anyway, even if the data can not be very well described with a Gaussian distribution
we can expect that the estimated values will not be totally irrelevant and that they will at
least maintain some part of the information of the relationships between the variables.
We can gain some confidence that EM is not bad for our problem if we see that basic
statistics of the filled in variables are not drastically altered compared to the ones prior
to gap filling, or that totally out of range values are not introduced or that the behavior
of a variable through time is not largely affected. For both of the data sets used in
Chapter 4. Data cleaning 24
0 50 100 150 200 250 300−20
−10
0
10
20
30
0 50 100 150 200 250 300 350 400 450−0.5
0
0.5
1
1.5
Figure 4.1: Filled parts for the Fc and the m Ustr L0 L0 variables of the Griffin data set.
The variables maintain a behaviour in the filled part similar to the one that is observed
in the present part
this study, the filled data sets seemed to be quite consistent with the ones prior to gap
filling. The basic statistics were not largely affected and we don’t have out of range
values. Also, the variables seem to maintain a regular behaviour through time. This
can be seen in figure 4.1. The continuous line is the observed values of the variable
and the dotted line is the filled values.
Similar things can be seen in Figure 4.2 for two variables of the Harwood data set. It
is worth noting though that in a few cases, similar plots showed too smooth fillings or
fillings where the variable had somewhat smaller variance than indicated from the rest
of the plot. In general though, the variables have maintained some regularity and no
out or range values seemed to appear, therefore we can think that both data sets have
been filled in a quite consistent way.
Chapter 4. Data cleaning 25
0 50 100 150 200 250 300 350 400 450 500 550−30
−20
−10
0
10Fc
0 50 100 150 200 250 300 350 400 4500
1
2
3
4
5
6
7tMeanU
Figure 4.2: Filled parts for the tFc and the tMeanU variables of the Harwood data set.
The variables maintain a behaviour in the filled part similar to the one that is observed
in the present part
Chapter 4. Data cleaning 26
4.5 Noise / smoothing
Having finished with the missing value problem we will now discuss about the prob-
lem of noise. We have argued earlier that the data contained noise. The reason is that
the data come from physical measurements that are prone to noise. Since we want to
model the clean data generating mechanism and not the noise, it would be desirable to
reduce the presence of noise.
A quick and easy method that seems to work well in most cases is moving averages.
Given a sequence of N points, the n-moving average of this sequence is another se-
quence of N � n � 1 points. Each point is computed as the mean of the surrounding n
points in the original sequence. If a j denotes the jth point of the original series and si
denotes the ith point of the new series then:
si � 1n
i � n � 1
∑j � i
a j
It is obvious that this is a very simple method. On the other hand it can be very effi-
cient in filtering out noise. The choice of n is a little arbitrary though. We want to have
smoothed variables but we also want to maintain the general trend of the variables. If
n is too small, then probably noise is probably not very well filtered out. If n is too big,
then the general trend of the variables is lost. The only way to determine n is to try.
Let’s check this in practice for the case of the Harwood data set. In Figure 4.3 we can
see the variable tmvpd with no smoothing, 3 points, 5 points and 10 points averaging.
We could certainly say that the unsmoothed variable looks a little rough. This rough-
ness could easily be interpreted as noise to the general shape of the variable. Ideally,
we would like to maintain only the general shape. The smoothed versions look much
better. Three points averaging looks still a bit rough though. Five points moving av-
erages looks quite good but ten points moving averages looks a little bit too smooth,
some details from the general shape of the variable seem to be lost. Similar results
were obtained for more examined variables for both data sets.
Unfortunately, quantifying the level of noise is very difficult. Therefore, determining n
or choosing a specific method for filtering noise is very hard. We can not be sure about
possible loss of information. This would be possible if we had a sample of the clean
data so that we could separate the real trend from noise.
Chapter 4. Data cleaning 27
0 100 200 300
1
1.5
tmvpd: No smoothing
0 100 200 300
1
1.5
tmvpd: 3 Step moving averages
0 100 200 300
1
1.5
tmvpd: 5 Step moving averages
0 100 200 300
1
1.5
tmvpd: 10 Step moving averages
Figure 4.3: Not smoothed, 3 points, 5 points and 10 points moving average for the first
300 points of the variable tmvpd of the Harwood data set
Chapter 4. Data cleaning 28
Anyway, considering the fact that we ultimately want to unveil the underlying process
and not to model noise, we will use smoothed data. We will use mainly the five points
averaged data since this seems to preserve most of the essential information. On the
other hand, it is possible that the smoothing process resulted not only in filtering out
noise but in real loss of information. Thus, we will also keep on using unsmoothed
data, for comparison.
Chapter 5
Dimensionality reduction
The main issue with the Griffin data set was its size. This made it very expensive to
process. It was not only the about 60000 data points (initially 100000) that it contained
but also its high dimensionality, the large amount of variables it had. As described
earlier, there were originally 310 variables but finally 281 were decided to be used.
That is, 280 input variables and 1 target variable. This made difficult not only the
missing values problem, as described in the previous chapter, but also further use of
the data set for modelling. This data set is very big for a sophisticated machine learning
algorithm. Trying, for instance, to train an MLP with 60000 examples and 280 input
variables would be very expensive. In such a case, running the model many times to
optimize its parameters, for example the number of hidden units, will not be feasible.
In addition, the complete set of 280 input variables will probably be a poor set of
predictors. This is because if there are irrelevant features in it, it is quite possible that
they will add noise to the final prediction. The learning algorithm might find false
regularities and this will lead to poor models. This is an overfitting effect and it is
not desirable. Also we want to have models as simple as possible so that they are
easily interpretable. For these reasons, some dimensionality reduction techniques had
to be used. A few methods were tried under a few assumptions that will be described
later and different subsets of features were obtained. For the Harwood data set the
motivation for dimensionality reduction was different. It was not mainly the cost of
the machine learning algorithm, but the need to find the best set of predictors.
In this chapter we will first briefly discuss about the dimensionality reduction and
29
Chapter 5. Dimensionality reduction 30
feature selection problem. Next we will examine the methods that were used in this
study and finally, we will talk about the results obtained from our methods.
5.1 The dimensionality reduction problem
When having to deal with high-dimensional data sets, usually some dimensionality
reduction technique has to be used. The reason for this is that a big data set can be
very expensive to use with a sophisticated modelling technique. In addition, it is quite
probable that at least some variables from the initial set are not relevant to the problem
and should be ignored.
Many methods have been used for reducing the number of features of high dimensional
data sets. These can be divided in two main categories: feature extraction and feature
selection methods. Feature extraction methods create new features from the existing
ones. Principal Components Analysis (PCA) and Locally Linear Embedding (LLE)
are examples of feature extraction methods. These methods make transformations and
recombinations of the original features to produce new ones. For example, PCA takes a
linear combination of the original variables. The number of the new features is smaller
than the original but hopefully they preserve the necessary information for subsequent
processing. The disadvantage of those methods is that usually the meaning of the orig-
inal features is lost and the models built from automatically extracted features are hard
to interpret. It is also worth noting that feature extraction methods are not always able
to cope with the problem of relevance of features. In this study, this is not desired be-
cause our original aim is to gain insight to the carbon uptake process and how various
factors affect it. It would be difficult to say for example that Vapour Pressure Deficit
has an important effect on the process.
The other main category of dimensionality reduction methods, feature selection meth-
ods, choose a subset of the available variables. The meaning of the variables is not
lost and therefore models can be more easily interpreted. We want to find the subset of
features with the best predictive accuracy for the model that will be used. With predic-
tive accuracy we mean the generalization accuracy, the performance of the model on
unseen data. As discussed earlier, the best set of features for predicting the target vari-
Chapter 5. Dimensionality reduction 31
able is probably not the full set of features. Probably the full set of features contains
irrelevant with the task variables that add noise to the final prediction. We need only
the variables that really affect the target variable. Choosing a subset of the variables
is not an easy task though. In a data set with n dimensions there are 2n � 1 possible
subsets of features. Supposing that we have defined a measure of the predictive accu-
racy of a subset of features, it is obvious that with a big number of features it will be
very difficult to give a score to every possible subset and then pick the best. Therefore,
feature selection can be seen as a big search problem. We want to find the subset that
has the best generalization score but we can’t consider all the combinations. Thus, for
high dimensional data sets, where exhaustive search is not possible, various heuristic
search techniques are used.
Independently of the search algorithm used, feature selection methods can be divided
in two groups. One is the wrapper approach and the other is the filter approach. In the
wrapper approach we evaluate each subset based on a direct estimate of its predictive
accuracy using the model that we ultimately intend to use. The estimate is usually
taken using k-fold cross validation. That is we split the data set in k parts, and then for
k times we train the model with k � 1 parts and test the model with the remaining part.
We do this k times using each time a different part for testing and the rest for training.
This can give a good estimate of the real error of the tried model and it can help in
avoiding overfitting and ultimately detecting irrelevant features. Therefore, a wrapper
algorithm takes into account this information and guided from a search procedure it
looks for the subset that has the best generalization error estimate. Apparently, this
can be very expensive, depending on the size of the data set and the complexity of
the model that is to be optimized. Wrapper methods are usually suitable for not too
big problems or when a relatively large computational cost is not very important. For
larger data sets, usually a filter algorithm is prefered. Filter algorithms for feature se-
lection choose subsets of features based on general statistics of the data and therefore
don’t take into account the model that will be actually used. Apparently, we can expect
better results from a wrapper algorithm, whereas a filter algorithm is cheaper but will
probably not give an optimal solution.
More details about the subject of irrelevant features and the feature selection problem
Chapter 5. Dimensionality reduction 32
can be found in (John et al., 1994) and (Langley, 1994). In this study, some variances
of a semi-wrapper approach and a filter algorithm were tried.
5.2 The semi-wrapper methods
As we said, the wrapper approach can be quite expensive, depending on the size of the
data set and the complexity of the learning algorithm that will be used. Considering
the fact that initially we had to do with the Griffin dataset and that we wanted to use
ANN and SVR for modelling, the direct wrapper approach looked rather infeasible.
Therefore we decided to use a slightly different method. Our approach was to use a
small and cheap model to estimate the predictive capacity of a subset of features. We
hope that the cross validation error using a simple model will give an indication about
the suitability of a set of features for modelling with a more sophisticated and expen-
sive model. The cheaper model was Multiple Linear Regression (MLR), multivariate
least squares fitting which can be performed very fast (section 6.1.1. This could be
considered to be a semi-wrapper approach since it uses actual model fitting but it does
not use the model that will be ultimately applied to the data. One could argue that this
is not a suitable solution to the problem. In some sense this is right. It is very likely
that the best model is non-linear and we cannot know the nature of this non-linearity
without exploring in a great extent the model space. The semi-wrapper approach is not
the optimal solution. The optimal solution is clearly the pure wrapper approach but
when we face big computational challenges like the one we had with the Griffin data
set, something feasible and reasonable for the problem has to be found. Nevertheless,
this solution is suggested in (Bishop, 1995) and it finally looks like a good idea. Ac-
tually we could hope that the target variable depends from the best set of predictors
in an approximately linear way and therefore a linear model could give a sufficiently
good approximation. Furthermore, applying later the non-linear and complex model
with the selected subset of features we would expect to get an even better fit.
The difference between this approach and the pure wrapper approach is only the way
that we evaluate the subsets. The rest is the same. We still need a search procedure.
Next we will describe the search procedures that we used. These are just heuristic ways
Chapter 5. Dimensionality reduction 33
to avoid exhaustive search. The first three of the next methods are greedy search algo-
rithms whereas the last two are stochastic. The greedy methods may find a sub-optimal
solution, they can stop to a local maximum of the search space whereas the stochastic
methods can avoid local maxima but they also cannot guarantee a sufficiently good
solution. All of them were implemented in Matlab and in all of them 10 - fold cross
validation MLR was used to assess the generalization error. In what will be described
next, it will be handy to use binary strings to represent possible subsets, where a bit in
the kth position determines the presence or absence of the kth feature to the subset that
the string represents.
5.2.1 Forward selection
Forward selection is one of the simplest and most commonly used search methods for
feature selection using the wrapper approach (suitable for our semi-wrapper approach
also). We start with an empty set of features and we iteratively add the one that de-
creases most the estimated generalization error. We can stop either when a maximum
predetermined number of features has been selected or when there is no variable that
can be added to the subset and improve it. Alternatively, we could stop when the error
can’t decrease enough by adding any variable. In our experiments, having no prior
knowledge about the number of features that are really the important predictors of the
target variable, we just stopped when there was no feature that could be added to the
set of selected features that decreased the estimate of the generalization error. Using
the binary string representation, the forward selection algorithm is described in Table
5.1.
The main disadvantage of forward selection is that it is unable to predict strong de-
pendencies between sets of variables. For example, it is possible that two variables are
highly predictive together but each one alone is not a good predictor. Forward selection
will have difficulty detecting a situation like that.
Chapter 5. Dimensionality reduction 34
1. Initialize the current solution string to contain only zeros and set the current
solution error to a large number.
2. For every zero bit in the current solution string, generate a new string, turn-
ing that bit to one.
3. Calculate the cross - validation error for the subsets that the generated strings
represent.
4. If the termination criterion is satisfied then stop else update the current so-
lution string with the best of the generated strings, set the current solution
error to the estimate of the error of this string and go back to 2.
Table 5.1: The forward selection algorithm
5.2.2 Backward elimination
This is very similar to forward selection but it works the opposite way. It starts with
the full set of features and successively removes the feature whose absence results in
the best performance. Again we stop when no improvement is possible or when we
reach a specific number of variables. The algorithm is described in Table 5.2.
1. Initialize the current solution string to contain only ones and set the current
solution error to a large number.
2. For every one bit in the current solution string, generate a new string from
the previous one, turning that bit to zero
3. Calculate the cross - validation error for the subsets that the generated strings
represent
4. If the termination criterion is satisfied then stop else update the current so-
lution string with the best of the generated strings, set the current solution
error to the estimate of the error of this string and go back to 2.
Table 5.2: The backward elimination algorithm
Chapter 5. Dimensionality reduction 35
1. Initialize the current solution string (with zeros, ones or randomly), evaluate
it and set the current solution error to this.
2. For every bit in the current solution string, generate a new string, flipping
that bit
3. Calculate the cross - validation error for the subsets that the generated strings
represent
4. If the termination criterion is satisfied then stop else update the current so-
lution string with the best of the generated strings, set the current solution
error to the estimate of the error of this string and go back to 2.
Table 5.3: The best ascent hill climbing algorithm
Backward elimination does not have the problem that forward selection has at the same
extent. It is though more expensive because it starts with the full set of features and
therefore starts examining a set of models that have a big number of input variables
that are harder to train. Although, this is very serious in the wrapper approach, in the
semi-wrapper approach, if the data set is not that big, it is not a very big problem.
5.2.3 Best ascent hill climbing
This is a combination of the previous two. At every step is adds or removes the attribute
whose addition or removal results in the smallest error. It is more flexible and searches
the space of solutions a little better than the previous two that are very strict. The initial
point of search could be either a random subset of the variables or the empty set (so
that we can compare it with forward selection) or the full set (so that we can compare
it with backward elimination). The termination criteria can be the same as for forward
selection or backward elimination. The algorithm is presented in Table 5.3.
This is a little more expensive than forward selection but not necessarily as expensive
as backward elimination. It is also worth noting that in highly complex error spaces,
when we start from different random initial strings, we may get different solutions.
This doesn’t happen with forward selection or backward elimination because the initial
Chapter 5. Dimensionality reduction 36
point is always the same. This shows how difficult it really is to do optimal feature
selection in some cases.
5.2.4 Genetic search
Genetic algorithms are a population based technique that have the ability to search ef-
ficiently in very large spaces. Considering the fact that the feature selection problem is
basically a big search problem, genetic algorithms looked like an interesting approach.
They maintain a population of candidate solutions that evolve through crossover and
mutation, imitating natural evolution. We will not go in explicit details on how genetic
algorithms work here. There are many different varieties of genetic algorithms and we
cannot examine many details here. We will just discuss about their application to our
specific problem and from this the basic idea will be made clear. The outline of the
genetic algorithm used in this study can be seen in Table 5.4.
The basic structure should be quite clear. The population evolves using mutation and
crossover, while the selection process reinforces good chromosomes. Hopefully in the
end the population contains a quite good solution.
We should explain some of our choices for the genetic algorithm. First of all, we have
limited the length of the part that is exchanged in the binary chromosomes. This was
decided to avoid the possibly big disruptive effects of crossover. We know that there
are good combinations of features that can be recombined to give even better sets of
features. If we had not set a limit to the length of the exchanged part, then crossover
would probably destroy good combinations of features. On the other hand, now we
can hope that this will not be that frequent but at the same time, the exchange of useful
chromosome bits will be possible.
Another important issue is the choice of fitness function and the selection mechanism.
Here as the fitness function we used just the Mean Squared Error (MSE) obtained
from k-fold cross validation. The selection mechanism was tournament selection. This
results in maintaining the variety in the population. If some other selection mechanism
was used, it would be possible to have much less variety in the chromosomes and this
would reduce the explorative effect of crossover. On the other hand, we believe that
Chapter 5. Dimensionality reduction 37
1. Generate a random population of binary strings of size n.
2. Evaluate the population, calculate the generalization error of the chromo-
somes (fitness function).
3. Pass the best k of them in the next population immediately.
4. Generate an intermediate population of size n � k by selecting strings based
on their generalization error and using tournament selection.
5. Apply two point crossover to the strings of the intermediate population: ex-
change parts from pairs of chromosomes. The length of the exchanged part
is limited.
6. Apply mutation to the strings of the intermediate population: randomly flip
some of the bits of the intermediate population.
7. Pass the strings of the intermediate population to the next population.
8. If the maximum number of iterations has been reached then terminate, else
go to step 2
Table 5.4: A genetic algorithm for feature selection
Chapter 5. Dimensionality reduction 38
there are a few very good combinations of features and therefore we probably should
actively reinforce solutions close to those. For this reason, we use elitism, i.e. we keep
always the best solutions from generation to generation. This can help towards that
direction. Given that we want to maintain a reasonable balance between exploitation of
already found good solutions and exploration of new solutions, this looks like a quite
reasonable decision. Of course, we could decide to use more exploitation (possibly
using a more aggressive selection mechanism) considering that there are not many
good combinations of features and therefore, we should explore solutions close to
these mainly. Of course the ideal solution looks to be balancing search close to already
found good solutions and new solutions. This is probably worth experimenting with.
For a better description of basic concepts mentioned here see (Mitchell, 1996).
The genetic algorithm approach is clearly more expensive than the previous methods
but it can avoid local minima and therefore can give solutions that are very hard to find
with a greedy method.
There are also a couple of different approaches for using genetic algorithms for feature
selection but were not tried. In one of them, the problem is considered to be a dual
criteria optimization problem. One criterion for optimizing is the fit of the model to
the data measured with the explained variance and the other is the size of the model
(number of variables). It tries to do that using a�µ � λ � population management method
that will be also described in the next section. When choosing the µ - sized population
it uses one criterion and when it chooses the λ - sized population it uses the other.
More details can be found in (Wallet and et al., 1996).
Another approach can be found in (Yang and Honavar, 1999). This is quite similar to
our method but it has a different fitness function, it uses a different selection method
and it doesn’t use limited length two point crossover. The fitness function involves an
estimate of the error but this obtained using a fast neural network.
5.2.5 Evolution strategies
Evolution strategies are closely related to genetic algorithms. They are both evolu-
tionary population based techniques. The difference between evolution strategies and
genetic algorithms is that evolution strategies do not use crossover. The basic algorithm
Chapter 5. Dimensionality reduction 39
1. Generate a random population of binary strings of size λ.
2. Evaluate population chromosomes, calculate their generalization error.
3. Shrink the population, keeping only the µ fittest chromosomes.
4. Expand the population, creating random mutations of chromosomes from the
µ - sized population chosen based on their fitness and getting a new λ - sized
population.
5. If the maximum number of iterations has been reached then terminate else go to
step 2.
Table 5.5: Evolutionary strategy for feature selection
is presented in Table 5.5.
Something important to note here is that we select the chromosomes of the µ popu-
lation from the whole λ population and not only from the children of the previous µ
population. This is called (µ � λ) selection and has an effect similar to the effect of
elitism in genetic algorithms. Considering what we said before, when we discussed
about genetic algorithms, about experimenting more with exploiting good solutions,
this looks like a reasonable choice. The other option would be (µ � λ) selection where
we would select the members of the µ population only from the children of the λ pop-
ulation.
Evolution strategies could be a suitable method for our problem. Given the fact that we
expect some good combinations of features, it would be reasonable to try random ex-
ploration around such combinations to find the real optimum. The cost of this method
is more or less the same as the cost of the genetic algorithm.
5.3 The filter method
The difference between wrapper (or semi-wrapper methods that were described in the
previous section) and filter methods is that wrapper methods evaluate a subset of fea-
tures using actual model fitting whereas filters use just statistical information of the
Chapter 5. Dimensionality reduction 40
features. Many different filter methods exist based on information theory criteria, vari-
ance criteria, correlation criteria and much more. We tried only one filter method,
Correlation - based feature selection which is described next.
5.3.1 Correlation-based feature selection
Correlation - based feature selection (CFS) is described in (Hall, 1999). The CFS
algorithm looks for a subset of features that are highly correlated with the target but
not highly correlated with each other. This means that it looks for a set of features
that carry a lot of information for the value of the target but does not carry redundant
information. For this, a basic heuristic measure that has to be maximized has been
defined:
Merits � krt f�k � k
�k � 1 � r f f
where Merits is the desirability of the subset of features s, k is the number of features
that s has, rt f is the average correlation between the subset’s features and the target
variables and r f f is the average correlation between the subset’s features. This heuristic
measure will increase with a subset of features that have high average correlation with
the target and with a subset of features that have low correlation with each other. In
the case of continuous attributes, like the ones that we have here, correlation between
features is computed using Pearson’s correlation:
ri j � C�i � j ��
C�i � i � C � j � j �
Where ri j is the correlation between features i and j and C�i � j � , C
�i � i � and C
�j � j � are
elements of the covariance matrix of the data, that is C�i � j � is the covariance of fea-
tures i and j, C�i � i � is the variance of feature i and C
�j � j � is the variance of feature j.
In a highly dimensional data set, an evaluation of all possible subsets is still not possi-
ble although this now involves much simpler and cheaper calculations than the wrapper
or semi-wrapper approaches. Therefore, here we can also use any search mechanism
we think is suitable. In (Hall, 1999), best first search is proposed but other search
strategies would be possible.
It is also worth noting that the Merits heuristic is very aggressive: it tends to select only
Chapter 5. Dimensionality reduction 41
a few features. For this reason, r f f is usually multiplied by a number between 0 and 1.
This discount factor can make the algorithm less aggressive and force it to select more
variables.
It is clear that this is not the best possible solution. We have argued before that filter
methods are not as good as wrapper methods (semi-wrappers also) but they are suit-
able for high-dimensional datasets. Consequently, this method was quite useful for the
Griffin data set.
5.4 Results of the feature selection methods
Here we will talk about the results and behavior of the previously described feature
selection methods. We will talk about both the Griffin data set and the Harwood data
set.
5.4.1 Feature selection results for the Griffin data set
Selecting the best subset of features in the Griffin data set was initially considered to
be a very big challenge. Most of the previously described algorithms were motivated
by the dimensionality of the Griffin data set and it’s huge number of possible subsets
of features. Unfortunately, it was at that point that it was discovered that there were
spurious variables in the data set. Therefore, we will just briefly mention what was the
general behavior of the feature selection algorithms with the Griffin data set. We will
just remind here that the initial number of input variables was 280 and that the error is
measured using 10-fold cross validation and a linear model.
Forward selection chose a subset of 95 variables with the final MSE being only 0.053.
The error dropped fast with the addition of the first few variables and then decreased
very little (Figure 5.1). Also, with the addition only of the first variable the error was
only 0.1590. The variable was m PPFDa P13 L1 and was one of the spurious vari-
ables. When plotted together with the target variable, it was found that they are almost
the same. Forward selection, even though we were using the semi-wrapper approach,
was quite expensive. We got the first few features quite fast but the algorithm termi-
nated after approximately one day. The first few variables were added after trying very
Chapter 5. Dimensionality reduction 42
10 20 30 40 50 60 70 80 900.04
0.06
0.08
0.1
0.12
0.14
0.16
Variables
MS
E
Figure 5.1: MSE as forward selection adds variables for the Griffin data
simple and not costly models, but as more variables were added, the complexity in-
creased. If we had predetermined the number of features to select to a small number
the cost would not be that big.
Backward elimination was not possible for the Griffin data set. It was left running
for about twenty days and still did not finish. This is because it starts searching from
very big models (279 variables) and 10-fold cross validation is very expensive for such
models.
Best ascent hill climbing gave results quite similar with forward selection, just a little
better and it also chose a little smaller subsets. It has been tried starting either from a
random subset of features or from the empty subset. In both cases, the same spurious
variable was chosen first (in the case of random initial subset, when it was not already
included in it). Hill climbing was also quite expensive running for a couple of days.
With the genetic algorithm it was a little different. Usually it also found one of the
spurious variables quite fast showing that they could get a sufficiently good solution
fast. From here it came apparent that it was not only one variable that was an almost
duplicate of the target variable but at least three. It was observed, that when either of
those three variables was randomly present in any of the chromosomes, further signif-
icant improvement was very difficult and the error curve was decreasing very slowly
after that. Depending on the parameters for the number of generations (in most of the
experiments it was 50 generations), the number of chromosomes (40 usually) and ini-
Chapter 5. Dimensionality reduction 43
tial population chromosomes, it could take from about 6 hours till a couple of days.
Evolution strategies behaved quite similarly to the genetic algorithm. Again, depend-
ing on the number of generations and size of populations, it could take quite a while.
Finally, CFS selected from 1 to 7 variables depending on the discount factor. Contrary
to the previous, it was not very time consuming, taking about 5 minutes to run. Without
using the discount factor it selected only the same m PPFDa P13 L1 variable. When
we decreased the discount factor, that variable was always in the set of selected.
5.4.2 Feature selection results for the Harwood data set
The feature selection problem in the Harwood data set was quite different. The search
space was much smaller. Therefore, the algorithms behaved in a different way. Here
we had only 17 candidate input variables. We also have to note that we tried the feature
selection algorithms for data with different smoothings. Thus, we will talk about both
of those cases.
Something very remarkable happened for the Harwood data set: all the algorithms se-
lected the same subsets of features for both smoothed and not smoothed data (different
for smoothed and not smoothed though). This makes clear the fact that the search space
is not that big and not too complex, resulting in many different algorithms finding the
same solution. But let’s check some details of the different algorithms.
First of all, forward selection stopped when it selected 14 variables for the not smoothed
data and 15 variables for the 5 points averaged data. The error curve for these cases
can be seen in Figure 5.2.
It is clear that the smoothed data can be modelled much better than the not smoothed
data. The not smoothed data reaches an MSE of 10.7976 and the smoothed data an
MSE of 6.7951. This will be valid in all of our next experiments and this is quite rea-
sonable as well since the smoothed variable is much easier to model. Forward selection
was much faster here than in the Griffin data set. It only took a few minutes. It could
be even faster if we stopped after the first few variables, deciding that we want only a
specific number of variables.
Backward elimination was possible here. The cost was not that big and it took less than
10 minutes. As we said the results for both the smoothed data and the not smoothed
Chapter 5. Dimensionality reduction 44
2 4 6 8 10 12 14 166
7
8
9
10
11
12
13Not smoothed data5 points moving averages
Figure 5.2: MSE as forward selection adds variables for the not smoothed and 5 point
smoothed Harwood data
data were the same like forward selection.
Hill climbing was tried a few times from random initial subsets of features and in all
cases it found again the same solution. This actually showed that the error space prob-
ably has only one local maximum and this was the one that all algorithms found.
The genetic algorithm was also much faster of course than in Griffin. It was left run-
ning for about 50 generations but after about 10-15 it had found the same solution as
the other algorithms.
Evolution strategies worked like the genetic algorithm as well. It was left running for
50 generations but it had found the same solution after about 20 generations.
Finally, CFS was much different from the previous methods. On both smoothed and
not smoothed data (it doesn’t make a big difference in the correlation matrix anyway)
and with a range for the discount factor from 0.01 to 1 it always selected only one
variable. This variable was trh (relarive humidity).
Apparently in this data set, there was only one local minimum for the cross valida-
tion error of diffferent subsets of features and that was the global minimum. Since the
search space behaves so good, the use of so many different algorithms does not make
much sense. Of course, we couldn’t know that from the beginning. In a more complex
space though, that has many stationary points, as we expected for the Griffin data set,
the use of many different algorithms makes sense. It is possible that there is some
Chapter 5. Dimensionality reduction 45
clearly optimum point that cannot be easily reached from some of the algorithms. In
our problem, we could expect that there is a subset of features that has a significantly
better performance than the others. This subset will not necessarily be reachable for
some of the algorithms but is worth finding if it is sufficiently better than the ones that
the other algorithms find.
5.5 Discussion about the results of feature selection
There is no point in discussing the results for the Griffin data set. On the other hand,
since we will use the Harwood data set for further modelling, we have to decide which
variables we will finally use. Clearly, we would expect to get a smaller number of
features. The selection procedure discarded only 2 and 3 features for the cases of
smoothed and not smoothed data. Looking at Figure 5.2 we see that the improvement
from the first features to the last is not that big. Therefore, we could discard many
of them. Actually, it was decided to use the first 4 features. Considering that a lot of
time will be spent adjusting the parameters of the models, this looked like a reasonable
decision. The four variables for the not smoothed data were:
1. tPARin (incoming radiation)
2. trh (relative humidity)
3. tLEc (latent heat flux)
4. tTs5 (soil temperature at 5 cm depth)
For the smoothed data the variables were:
1. tPARin (incoming radiation)
2. tLEc (latent heat flux)
3. tMeanQ (absolute humidity)
4. twdir (wind direction)
Chapter 6
Modelling
Having finished preprocessing, we will now continue with model fitting. Here we will
only use the Harwood data set, since modelling could not be carried out with spurious
variables. We would try to predict a value using almost the same value as input.
Two popular machine learning techniques have been used: Support Vector Regression
(SVR) and Multilayer Perceptrons (MLPs) that are a kind of Artificial Neural Net-
works (ANNs). Both are supervised learning techniques. That is, they are given some
examples of inputs and the corresponding outputs and are expected to learn the rela-
tionship between them, predicting the correct output for previously unseen input. Both
methods’ flexibility to model non-linear data has made them quite popular and they
have been applied to a huge variety of applications.
A library of machine learning routines written in C++, Torch 31 (Collobert et al., 2002),
was used for both MLP and SVR modelling. This was very handy since much of the
functionality needed was already implemented and we just had to put the necessary
routines together.
In this chapter we will first introduce MLPs and SVR.a long with MLR that was used
for feature selection and that appears also in the results for comparison. Next we will
talk about modelling and experimental design and finally we will present the results
obtained.1http://www.torch.ch/
46
Chapter 6. Modelling 47
x
y
Figure 6.1: The least squares solution given a set of observations�xi � yi � .
6.1 Modelling techniques
6.1.1 Multiple Linear Regression
Multiple Linear Regression (MLR) is the well known least squares fitting. The solution
of MLR is a simple line that best interpolates the data. This can be seen for one
dimensional input in Figure 6.1. If we have a d-dimensional input x the output of MLR
is:
y�x ��� d
∑k
wkxk � b (6.1)
But how is best interpolation defined and how do we compute the coefficients wk and
b? The best line is the line that minimizes the Sun Squared Error (SSE):
SSE � n
∑i
�yi� yi � 2 (6.2)
The solution for wks and b can be found by solving an d � d sized linear system. It is
convenient for that to express everything in matrix notation. We define the vector w
that has�b � w1 ����� wd � T and the augmented input vector x that is
�1 � x1 ����� xd � T . Then (6.1)
becomes:
y�x ��� wT x (6.3)
Now, defining X as the matrix that has all the x vectors as just defined as columns and
Y a vector of the corresponding target values,the SSE (6.2) is computed with:
SSE � �Y � wT X ��� � Y � wT X � (6.4)
Chapter 6. Modelling 48
Since we want to minimise this, we just have to differentiate it with respect to w and
equate to zero. Differentiation gives:
XTY � XT Xw (6.5)
The solution is:
w � �XT X � � 1XTY (6.6)
Instead of computing the inverse matrix of (6.6) we usually solve the linear system that
(6.5) represents. When we have w we have found the best interpolating line.
6.1.2 Multilayer Perceptrons
Multilayer Perceptrons (MLPs) belong to the family of Artificial Neural Networks
(ANNs). ANNs are inspired from natural neural systems where information processing
occurs in stages: the result of one stage is introduced to the next and so on.
The basic computational unit of an ANN is the neuron. The simplest model of a neuron
is the perceptron. A perceptron simply computes a weighted sum of its inputs and
computes a non-linear function of it:
y � g� n
∑j
w jx j � b �where w j are the weights of the inputs to the neuron, x j are the inputs to the neuron
and b is a bias parameter. The non-linear function g is called the transfer function and
some usual choices for it are the sigmoid, hyperbolic tangent and the gaussian.
A simple perceptron is not very powerful. Its limitations are well known in the lit-
erature (Minsky and Papert, 1969): for classification they can model only linearly
separable problems while similar inflexibilities appear for regression. The separability
problem appears in Figure 6.2. The decision boundary of a perceptron that is used
for classification is just a straight line for two dimensional input or a n-1 dimensional
hyperplane for n dimensional input. The problem on the left is linearly separable but
there is no line on the problem on the right that can separate the two classes. Similarly,
in Figure 6.3 we can see that the output of a neuron has variability only in one direc-
tion, showing that a simple perceptron can not do much for complicated regression
Chapter 6. Modelling 49
Figure 6.2: A linealy separable problem (left) and a non-linearly separable problem
(right)
−1
−0.5
0
0.5
1 −1
−0.5
0
0.5
10
0.2
0.4
0.6
0.8
1
Figure 6.3: The output of a neuron with a sigmoid transfer function.
problems also. This neuron uses a sigmoid transfer function but similar things happen
for the other transfer functions as well. The problem lies in the linear averaging of the
inputs.
The solution is to use the idea of sequential processing described earlier. If we add
hidden layers of neurons to the network we can represent any continuous function. A
MLP is exactly that, a sequence of layers of neurons. Is is a fully connected feedfor-
ward ANN, i.e. all neurons from one layer are fully connected with all neurons in the
next layer and connectivity flows only in one direction. Various other connectivities
and architectures of ANNs exist but MLPs are probably the most popular. A simple
MLP with one input layer, one hidden layer and one output layer is represented graph-
Chapter 6. Modelling 50
1
1
1
l
k
m
Figure 6.4: A MLP with one input layer, one hidden layer and one output layer
ically in Figure 6.4. The input layer has k, the hidden layer has l and the output layer
has m units. Each box is a neuron with the functionality described.
But how can we learn the weights and the biases of the network? Given that we want
to minimize the error, we can learn the parameters using a non-linear optimization al-
gorithm like gradient descent. This can be expressed also with a method called Back
Propagation (BP). We will not talk thoroughly about this here. We will just say that
BP propagates the training error from the output to the previous layers and accordingly
readjusts the weights. Details can be found in (Bishop, 1995).
An important issue with MLPs is that we have to decide about their architecture: the
number of hidden layers and the number of units. Any function that can be modelled
with a network with any number of hidden layers can be modelled from a network with
one hidden layer and enough units in it. Therefore, we decided to use only one hidden
layer. The important question though is the number of units in the hidden layer. If we
don’t have enough hidden units, the network will not have the capacity to model well
the data. On the other hand, if we have many hidden units, the network will overfit the
data and we will not have good generalization performance.
Another issue related with overfitting is the time that we stop training. If we leave BP
or gradient descent running even when the training error is very low, then the MLP will
overfit the data. If we stop training early, then it will not capture the real relationship
between the input and target variables. More details about the issue of overfitting in
neural networks and what has just been described can be found in (Lawrence et al.,
Chapter 6. Modelling 51
1996). Consequently, although MLPs are a very powerful class of models, their very
high capacity for non-linear modelling though is counterbalanced from the fact that
they are difficult to train.
6.1.3 Support Vector Regression
Support Vector Machines (SVMs) is a new technique. SVMs emerged from the work
of Vapnik on computational learning theory (Vapnik, 1995), (Vapnik, 1998). Below we
will try to briefly introduce the basic ideas of SVMs for classification and after that we
will talk about Support Vector Regression (SVR). A more detailed description of the
ideas presented here can be found in (Burges, 1998), (Christianini and Shawe-Taylor,
2000), (Cortes and Vapnik, 1995) and (Smola and Sch, 1998).
The first main concept of SVMs is maximum separability. When we talked earlier
about perceptrons, we introduced the idea of linear separability (Figure 6.2). The per-
ceptron chooses for the decision boundary any separating hyperplane. On the other
hand, the SVM solution is unique: it is the hyperplane that results in the maximum
separability between classes (Figure 6.5(a)). Intuitively this looks like the safest choice
for the decision boundary of classification.
Let ti denote the target value that will be +1 for the one class and -1 for the other.
The separating hyperplane is w � x � b � 0 so that for data points belonging to class +1
it is w � x � b � 0 and for data points belonging to class -1 it is w � x � b � 0. Let also d �be the distance between the closest to the separating hyperplane point that belongs to
class +1 and the separating hyperplane and d � be the distance between the closest to
the separating hyperplane point that belongs to class -1 and the separating hyperplane.
The margin of the solution is the minimum of d � and d � . It is clear that the margin is
minimized when d � � d � and margin � d � � d � .
Now, let’s define the hyperplanes H � and H � that are parallel to the separating hy-
perplane and have a distance equal to the margin from the separating hyperplane. At
least one data point from class +1 lies on H � and one data point from class -1 lies on
H � . The distance between the H � and H � is twice the margin. If x � is a data point
lying on H � and x � is a datapoint lying on H � then the distance between H � and the
origin of the axes is w� x ��w� and the distance between H � and the origin of the axes is
Chapter 6. Modelling 52
(a) (b)
margin
w
w.x+b<0
X+
X−
H−H+
w.x+b>0 H
Figure 6.5: (a) Some separating hyperplanes and the maximum margin one (b) The
maximum margin hyperplanes H, H � and H �w � x ��
w� . Therefore, the distance between H � and H � is: w � x ��
w� � w� x ��
w� � w� x � � w � x ��
w� and what
we get is:
2 � margin � w � x � � w � x ��w� (6.7)
Furthermore, we can see that we can scale w and w0 so that we get essentially the same
separating hyperplane cw � x � cw0 � 0. Then we can choose w and w0 so that for all
data points:
w � xi � b � � 1 f or ti ��� 1 (6.8)
w � xi � b � � 1 f or ti � � 1 (6.9)
It is clear that for x � (6.2) will hold as an equation and for x � (6.3) will hold as an
equation also, that is:
w � x � � b ��� 1 (6.10)
w � x � � b � � 1 (6.11)
Chapter 6. Modelling 53
If we subtract the second from the first, we get:
w � x � � w � x � � 2 (6.12)
If we substitute the left hand side of this to (6.7) we get
margin � 1�w� (6.13)
We now have an expression for the margin. Given that we want to maximize the
margin, the problem now is to minimize�w�
subject to constraints (6.8) and (6.9). The
solution to this problem is given by a quadratic programming problem using Lagrange
multipliers and has the form:
w � n
∑i � 1
aitixi (6.14)
In this formula, ais are coeeficients that determine the contribution of each vector to
the solution. For most of the data points, ai is zero. Only the data points that lie on H �and H � have not zero ai. Those data points are the support vectors. Having found the
hyperplane with the maximum margin we can classify a new example x with:
tx ��� 1 i fn
∑i � 1
aiti�xi � x � � 0 (6.15)
tx � � 1 i fn
∑i � 1
aiti�xi � x � � 0 (6.16)
But what happens when the data is not linearly separable? Then a positive slack vari-
able ξi is introduced for every data point and instead of (6.8) and (6.9) we now have
the following constraints:
w � xi � b � 1 � ξi f or ti ��� 1 (6.17)
w � xi � b � � 1 � ξi f or ti � � 1 (6.18)
Doing this, we allow violation of the hard margin that the previous constraints implied.
We now have a soft margin that is allowd to break but with a penalty equal to xii. We
want to minimize both�w� 2 and ξis. The new function to be optimized is:
J � �w� 2 � C
� n
∑i � 1
ξi � (6.19)
Chapter 6. Modelling 54
feature spaceinput space
Figure 6.6: Transformation from the input space to the feature space where the problem
is linearly separable
C is a parameter that determines how important is the minimization of�w� 2 versus the
minimization of the slack variables. If we have a large C we get a higher penalty for
errors and we enforce more accurate solutions. The form of the solution is again the
same (equation 6.14).
The second ingredient of SVMs is the kernel trick. It is well known that in a very
high-dimensional space, data points have more chances of being linearly separable.
It would be handy to be able to map the input data to a higher dimensional space
where it is linearly separable. This higher dimensional space is called the feature
space and the transformation from the input space to the feature space where the data
is linearly separable can be seen in Figure 6.6. Let’s suppose that we achieve this with
a function Φ�x � . Now, having mapped the input space to the feature space, we can find
the maximum separating hyperplane in the feature space. However, the feature space
is used only in terms of inner products between vectors like Φ�xi ���Φ �
x j � . Fortunately,
we can avoid the very expensive computation of Φ�xi ��� Φ �
x j � andΦ�xi ���Φ �
x j � in the
higher (even infinite) dimensional space. This is achieved with a kernel function k
such that k�xi � x j � � Φ
�xi ���Φ �
x j � . We can use the kernel function instead of computing
the inner products to train the SVM and find the best separating hyperplane in the
higher dimensional space. Nothing else changes in the solution, we just substitute the
Chapter 6. Modelling 55
inner product xi � x j with Φ�xi ���Φ �
x j � . A widely used kernel is the gaussian:
k�xi � x j �!� exp � �
xi� x j
� 22σ2 (6.20)
This actually corresponds implicit computation in an infinite dimensional feature space.
This is the kernel used in our expirements also.
Having examined the basic consepts of SVMs, it is time to talk about SVR. First, let’s
remember that in classification we tried to put the data out of the hyperplanes H � and
H � . For the regression task something like that is not applicable, instead we want to
have the data as close as possible to the predicted value but not necessary exactly on
the predicted value, we allow some margin. For this reason we use the ε-intensive error
function:
Eε�z ���
" � z � � ε i f � z �#� ε0 otherwise
The ε-intensive error function is plotted in Figure 6.7, where we can see the ε-tube that
surrounds the prediction line. The ε-intensive error function tolerates errors smaller
than ε. In addition, there is again a soft margin that gives a penalty, i.e. the predicted
value is allowed to differ from the predicted value more than the quantity ε but this
way we get again a penalty. The penalty is again expressed using slack variables but
this time we have two, one for overpredicting and one for underpredicting, ξi and ξi.
The function that we want to minimize to get the solution is now:
J � �w� 2 � C
n
∑i � 1
�ξi � ξi � (6.21)
under the constraints:
y�xi � � ti � ε � ξi (6.22)
ti� y
�xi � � ε � ξi (6.23)
and the form of the solution is
w � n
∑i � 1
βixi (6.24)
Again, many of βis are zero. The data points that are in distance smaller than ε from
the tube have βi zero. The rest are the support vectors. Then we compute the prediction
Chapter 6. Modelling 56
e
x
x x
x
x
x
x
x
x
Figure 6.7: The epsilon function
for a new input x with:
y�x �!� n
∑i � 1
βixi � x � b � (6.25)
Again this can be kernelized so that the output is a nonlinear function of the inputs.
Like in the classification problem, nothing else changes, we just substitute the inner
products in the input space with the inner products in the feature space.
6.2 Modelling design
6.2.1 Parameter optimization
We have described two classes of machine learning models. Indeed, we did not de-
scribe some specific models but a huge set of models. To do modelling we have to
determine their parameters. That is we have to choose a specific model from the huge
space of models. There is clearly at least one setting of parameters for a specific class
of models, such that the performance is maximized. In practise though, finding the
specific values of parameters that maximize the performance is a very difficult, if not
impossible in many cases, task.
For the MLP we have to determine the number of hidden neurons and the training stop
criterion, i.e. the required accuracy of the model on the training set. For SVR we have
to determine the size of the ε � tube, the tradeoff parameter C and the width of the
Chapter 6. Modelling 57
gaussian kernel σ.
Unfortunately we cannot guarantee that if we first try to optimize one of the parameters
keeping the others constant and then doing the same with the others will give the best
results. Since optimizing the parameters is a huge search problem, it is obvious that we
need to compromise the cost of finding a good solution and the accuracy of the final
model.
Since we cannot consider all the possible settings for parameters, they are infinite, we
will just try to explore and find the best one that we can using values around the usual
setting for similar problems. What we have done here is that we setup a small pro-
gram that tries many different parameters around a range of reasonable values. We
used 3-fold cross validation to get an estimate about the suitability of an SVR model
and 10-fold cross-validation for MLPs. 3-fold cross validation was very expensive for
SVR. In order to have comparable results we wanted to do 10-fold cross validation
with SVR modelling but this would be very expensive.
6.2.2 Time series modelling
We must not forget that the Harwood data set consists of half-hourly measurements.
Therefore, it could be modelled using a time-series framework. Indeed we can expect
that this would give better results. The reason for that is that plants react almost im-
mediately to changes in their environment but not instantly. Usually, some time passes
until the results are apparent. Of course, in order to check that we have to optimize
also the number of time steps used for modelling, or include in the feature selection
process variables from previous time points, that we didn’t. We will try though time-
series modelling of 3-steps and 5-steps. Therefore, considering that we will try also
smoothed and unsmoothed data, we now have six combinations for each model:
1. Not smoothed, no time series modelling
2. 5 point smoothing, no time series modelling
3. Not smoothed, 3-step time series modelling
4. 5 point smoothing, 3-step time series modelling
Chapter 6. Modelling 58
5. Not smoothed, 5-step time series modelling
6. 5 point smoothing, 5-step time series modelling
6.2.3 Performance evaluation
The Mean Squared Error that was used for choosing between different models when
we did feature selection can also be used here for picking the best model. It does not
though give a good idea about how good the fit is. Therefore, we will use R2, the ex-
plained variance. A high value is desired. Also, tha bias tovariance ratio is presented.
This is equal to 1 � R2. A low value for this is desired.
It is worth noting that we compute this always on the test part of the cross validation
procedure so that we get a real estimate of the performance of the model. After com-
puting the measure of the performance for each fold we average them to get an overall
estimate of the performance. This way we can be quite sure that the performance of
the models on new data is reflected in that result.
6.3 Results
For comparison reasons, we first give the results of modelling using MLR. This will
help us see if the more complicated, non-linear machine learning model really im-
proves the prediction. This will give an indication if the idea of doing feature selection
on the linear model and then using the non-linear model for more precision was right.
Next we present the results for MLPs and SVR. For those we will only present the
performance measures for the best models found.
6.3.1 Multiple Linear Regression
MLR modelling achieves a rather good performance. The best model used the smoothed
variables and 3 steps for time series. It achieves 78% explained variance. This im-
proves a little on non-time series modelling with smoothed data, that had almost 76%
explained variance. In every case, modelling with smoothed data is more accurate than
Chapter 6. Modelling 59
with not smoothed data. Also, adding 2 more steps in time series modelling does not
increase the performance. The results can be seen in Table 6.1.
No smoothing Best R2 = 0.6614
MSE=11.5091
bvratio=0.3386
5-point smoothing Best R2 = 0.7595
MSE=7.1977
bvratio=0.2405
No smoothing, 3-step time series Best R2 = 0.7110
MSE=9.8158
bvratio=0.2890
5-point smoothing, 3-step time series Best R2 = 0.7818
MSE=9.8158
bvratio=0.2182
No smoothing, 5-step time series Best R2 = 0.7147
MSE= 9.6888
bvratio= 0.2853
5-point smoothing, 5-step time series Best R2 = 0.7817
MSE= 6.5287
bvratio= 0.2183
Table 6.1: The results for MLR
6.3.2 Multilayer Perceptron
MLPs clearly increased a little the performance compared to MLR but probably not as
much as we expected. Again the smoothed data give more accurate models but here
time series modelling has a small impact on performance. The results for MLPs are
presented in Table 6.2.
Chapter 6. Modelling 60
No smoothing Best R2 = 0.7456 NHU=12 Stop MSE=0.01
MSE=8.6599
bvratio=0.2544
5-point smoothing Best R2 = 0.8257 NHU=29 Stop MSE=0.05
MSE=5.2196
bvratio=0.1743
No smoothing, 3-step time series Best R2 = 0.7620 NHU=9 Stop MSE=0.001
MSE=8.0949
bvratio=0.2380
5-point smoothing, 3-step time series Best R2 = 0.8260 NHU=17 Stop MSE=0.05
MSE=5.2144
bvratio=0.1740
No smoothing, 5-step time series Best R2 = 0.7641 NHU= 12 Stop MSE= 0.001
MSE= 8.0168
bvratio= 0.2359
5-point smoothing, 5-step time series Best R2 = 0.8276 NHU= 16 Stop MSE= 0.001
MSE= 5.1575
bvratio= 0.1724
Table 6.2: The results for MLPs
6.3.3 Support Vector Regression
SVR is also distinctly better than MLR and in most cases it is at least almost as good
as MLPs. SVR also gave the most accurate model, 83% explained variance with
smoothed data and no time series modelling. Like in MLPs time series modelling
had no effect on performance and smoothed data gave more accurate models. It is
worth noticing that finding appropriate values for the parameters σ � ε and C was very
difficult: much more difficult than finding a good number of hidden neurons or the end
accuracy for the MLP. It was much slower than both previous methods. A 3-fold run
took a few hours and a 10-fold run took even more time. This could explain the fact
that five step time series modelling didn’t very well. It is quite probable also that for
the other experiments we didn’t find the optimal settings of parameters. However, we
Chapter 6. Modelling 61
expected better models from SVR. The results for SVR can be seen in Table 6.3.
No smoothing Best R2 = 0.7541 σ=110 ε=0.03 C=10
MSE = 8.3649
bvratio = 0.2459
5-point smoothing Best R2 = 0.8299 σ=140 ε=0.03 C=10
MSE = 5.0893
bvratio = 0.1701
No smoothing, 3-step time series Best R2=0.7465 σ=110 ε=0.03 C=10
MSE=8.6076
bvratio=0.2535
5-point smoothing, 3-step time series Best R2=0.8147 σ=100 ε=0.03 C=5
MSE=5.5435
bvratio=0.1853
No smoothing, 5-step time series Best R2 = 0.7067 σ=120 ε=0.03 C=10
MSE=9.9766
bvratio=0.2933
5-point smoothing, 5-step time series Best R2=0.8212 σ=120 ε=0.03 C=10
MSE=5.3523
bvratio=0.1788
Table 6.3: The results for SVR
6.4 Summary
Three modelling methods were tried, MLPs, SVR and MLR. MLR was used for com-
parison between the previous two and to see if the semi-wrapper approach was reason-
able. Indeed, SVR and MLPs improved the performance compared to MLR but not
as much as we expected. MLPs were slightly better than SVR but not in every case.
The best model achieved almost 83% explained variance. This was an SVR model
and used 5-point smoothed data and no time series modelling. It was expected that
SVR would clearly outperform MLPs but wasn’t. The reason for this was that SVR
Chapter 6. Modelling 62
was very slow and it was very difficult to configure: we had to choose three different
continuous-valued parameters.
Chapter 7
Conclusions and further work
Having presented the final modelling and the results it is necessary to discuss if the
goal of this project is fulilled, reexamine our decisions throughout this study, see how
we could possibly improve things and propose ideas for further work and research on
this problem.
In this chapter we will first discuss about our final conclusions for what has been done
and what has been achieved, then we will talk about future work and finally we will
discuss about the automated model extraction approach that we have used.
7.1 Conclusions
First we have to ask if the goal of the project was fulfilled. Considering that our ini-
tial goal was to investigate the use of automated machine learning techniques to the
carbon flux prediction problem, our goal was indeed achieved. We have found models
that represent the problem not perfectly, we cannot be sure about something like that
anyway, but at least adequately well. MLP and SVR models have been found with per-
formance almost as good as other work (Stubbs, van Wijk). Of course the results are
not really comparable since Stubbs and van Wijk modelled the process over a shorter
period of time and used hand-picked variables. If we take into account also that we
used no prior knowledge at all, for example in choosing the variables or for using some
63
Chapter 7. Conclusions and further work 64
known relationship between input variables to fill the missing values and that we build
our models for a period of about 18 months of observations, this study has been suc-
cesful.
It will be useful though to go through all the processing steps again to see in what way
they have affected the results and why our choices were finally justified or not. This
will be limited to the Harwood data set.
First of all, let’s talk about gapfilling. The method used was quite powerful. Under the
assumptions that were discussed in chapter 4 it looks like it did quite well. Of course
we can not be sure that the data was not biased from the gap filling process but as ex-
plained earlier this is something that we actually cannot do too much about. We think
that the EM algorithm is adequate for this study.
As far as noise removal is concerned, we could say that clearly our choice was not the
best possible. Especially, we should consider smoothing of every variable indepen-
dently and not get the number of averaging points for all the variables from just a few
variables. Some variables are less noisy than others. However, given the short amount
of time available, this is acceptable but should be avoided in future studies.
For dimensionality reduction we can say that the idea of using the linear model to eval-
uate the relevance of features is justified as far as computational cost is concerned but
as we said it is clearly not the optimal solution. Given the size of the real problem
though, it looks like this was a reasonable decision. It is clear though that it could
be that the direct wrapper approach gives significantly better results than the semi-
wrapper approach and therefore is worth trying despite the cost. Nevertheless, since
our approach is being used from other people with good results, we accept that it is
worth sacrificing a probably better solution for decreasing computational cost. Let’s
remember anyway that our goal is to explore for good models, that represent the pro-
cess sufficiently good, not necessarily the best one that is also very hard to find.
In addition, the decision to ultimately use the first four variables selected from forward
selection is a little arbitrary, although it was based on inspection of the error curve (Fig-
ure 5.2. It could be that the first 5 variables are indeed much better for the non-linear
problem, i.e. the seemingly little improvement that the addition of more variables gives
could result in much more essential for the modelling task information being contained
Chapter 7. Conclusions and further work 65
in the set of input variables. In this case, the non-linear method could probably get a
much better model. Thus, the important question is if all the necessary information for
modelling the problem is contained in the selected variables. The answer is probably
since we found adequately good models based on this set of variables. Considering
also that it would be possible to find even better models by finding better parameters
(especially for SVR) it looks like the necessary information is probably there. Never-
theless, we cannot be sure about that also. An investigation of the residuals of our best
models could shed some light to that.
For the modelling part, as we just said, it is indeed very hard to find the optimal model
given a set of input and output variables. We searched the space of models as thor-
oughly as we could given the time constraints, trying various different settings of pa-
rameters in a plausible range of values. Nevertheless, it is almost sure that we haven’t
found the best possible models given the input and target variables.
All these can just be summarised by saying that we feel confident that the process was
modelled adequately well but there is still much space for improvement. We propose
some ways to do this in the next section.
7.2 Future work
The most important thing that is worth trying is to try to apply the methods that have
been described to a bigger data set, like Griffin, but of course without spurious vari-
ables. This would give much more useful conclusions. However, we will talk here
about possible improvements that could be done for various parts of the analysis.
7.2.1 Data cleaning
For both data cleaning tasks that were performed we could try other approaches that
could give better results.
First of all, for gap filling using the EM algorithm there are two possibilities that should
enhance the quality of the filled values. The one option is taking into account also tem-
poral covariability, that means that the (much larger) covariance matrix and mean vec-
tor would be computed taking into account relationships between variables at different
Chapter 7. Conclusions and further work 66
time points also. This way more information would be incorporated in our estimation
and therefore we could expect more accurate estimates. This is described in (Schnei-
der, 2001). The other is to relax the rather strong assumption that the data follows a
simple gaussian distribution. It would be more realistic to say that the data follows a
mixture of distributions, i.e. we would suppose a mixture model for the data. This is
quite reasonable since probably variables have distinct different behaviors according to
seasonal changes, i.e. seasons are clusters and each of them has a different distribution.
The way to do this is described in (Ghahramani and Jordan, 1994b) and (Ghahramani
and Jordan, 1994a)
For noise removal also there are many choices for methods that can give more accurate
results and they have already been mentioned. The most important thing though is, as
we have already said, to consider each variable separately.
7.2.2 Dimensionality reduction
It would be worth considering a feature extraction method like PCA or LLE. It is
possible that these methods would maintain the important variability and information
required for modelling. Although models using the extracted features would be even
more hard to interpret than the ones that we already have, this is certainly an approach
that we could consider.
Also, since we finally wanted to try time series modelling, the feature selection algo-
rithms should also take this into account, i.e. we could let the algorithms also select
from variables that correspond to previous time steps as well.
It is also reasonable that a bigger number (or the whole set) of features selected from
the algorithms could be tried to see if the information that could be missing from our
models exists there.
7.2.3 Modelling
As far as modelling is concerned, except for the obvious choice of searching for better
parameters, a useful improvement would be to optimise the number of steps of time
series modelling. It would be also useful to try models where a ‘time of day’ variable
Chapter 7. Conclusions and further work 67
is included.
Furthermore, we could try different methods that look appropriate for our problem like
Radial Basis Function networks (Orr, 1996) or Hidden Markov Models (Rabiner and
Juang, 1986), whereas we could also use different design options for the methods that
we used: in MLPs we could try different transfer functions and in SVR we could try
different kernels.
Finally, it would be interesting to combine our methods with some existing parametric
model by trying to model the residual of the parametric model with some of our meth-
ods, like in (Dekker et al., 2001). Of course, we could also try to model the residuals of
our methods to see if there is any information present that we haven’t taken in account
when building the initial models.
7.3 General discussion
We have seen that totally automated extraction of models from features is feasible.
Indeed, the models were quite good although we used no prior knowledge about the
problem. The question is if a totally automated approach like this is always applicable
and when is it better to use prior knowledge and when not? Usually prior knowledge
about the problem gives better results. Nevertheless, this is not always true. Our
knowledge can be uncertain or incomplete (it could be like that in our problem) and
then we should probably explore the model space without use of this prior knowledge.
It is important to note though that in this case modelling is more difficult since there
are more choices that should be considered. The final conclusion is that in every case a
very good understanding of the problem is required and the modeller has to think very
seriously about the problem.
However, have to point out that modern machine learning techniques make modelling
without use of prior knowledge more effective. Their flexibility and wide applicability
makes them an ideal tool for modelling without use of prior knowledge. They can help
to discover new knowledge from raw data under the condition that the data contains no
inconsistencies (like the Griffin data set).
Bibliography
Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford University
Press.
Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition.
Data Mining and Knowledge Discovery, 2(2):121–167.
Chen, D., Hargreaves, N., Ware, D., and Liu, Y. (2000). A fuzzy logic model with ge-
netic algorithms for analyzing fish stock-recruitment relationship. Canadian Journal
of Fishery and Aquatic Sciences.
Christianini, N. and Shawe-Taylor, J. (2000). An introduction to support vector ma-
chines. Cambridge university press.
Collobert, R., Bengio, S., and Mariethoz, J. (2002). Torch: a modular machine learning
software library. Technical report, IDIAP.
Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning,
20(3):273–297.
Dekker, S., Bouten, W., and Schaap, M. (2001). Analysing forest transpiration model
errors with artificial neural networks. Journal of hydrology, (246):197–208.
Fujikawa, Y. (2001). Efficient algorithms for dealing with missing values in knowledge
discovery. Master’s thesis, School of knowledge Science, Japan advanced institute
of technology.
Ghahramani, Z. and Jordan, M. I. (1994a). Learning from incomplete data. Technical
Report AIM-1509.
68
Bibliography 69
Ghahramani, Z. and Jordan, M. I. (1994b). Supervised learning from incomplete data
via an EM approach. In Cowan, J. D., Tesauro, G., and Alspector, J., editors, Ad-
vances in Neural Information Processing Systems, volume 6, pages 120–127. Mor-
gan Kaufmann Publishers, Inc.
Hall, M. (1999). Feature selection for discrete and numeric class machine learning.
Technical report, Department of computer science, University of Waikato.
John, G., Kohavi, R., and Pfleger, K. (1994). Irrelevant features and the subset se-
lection problem. In Mavhine learning: proceedings of the eleventh international
conferrence.
Langley, P. (1994). Selection of relevant features in machine learning. In Proceedings
of the AAAI Fall symposium on relevance.
Lawrence, S., Giles, C., and Tsoi, A. (1996). What size neural network gives opti-
mal generalization? convergence properties of backpropagation. Technical report,
Department of electrical and Computer Engineering, University of Queensland.
Lek, S. and Guegan, J. (1999). Artificial neural networks as a tool in ecological mod-
elling, an introduction. Ecological Modelling, (120):65–73.
Little, R. J. A. and Rubin, D. B. (1987). Statistical analysis with missing data. Wiley.
Minsky, M. and Papert, S. (1969). Perceptrons: An Introduction to Computational
Geometry. MIT Press, Cambridge.
Mitchell, M. (1996). An introduction to genetic algorithms. MIT Press.
Orr, M. J. (1996). Introduction to radial basis function networks. Technical report,
Center for Cognitive Science, University of Edinburgh.
Parrot, L. and Kok, R. (2000). Incorporating complexity in exosystem modelling.
Complexity International.
Pyle, D. (1999). Data preparation for data mining. Morgan Kaufmann.
Bibliography 70
Rabiner, L. R. and Juang, B. H. (1986). An introduction to hidden markov models.
IEEE ASSP Magazine, pages 4–15.
Recknagel, F. (2001). Applications of machine learning to ecological modelling. Eco-
logical Modelling, (146):303–310.
Schneider, T. (2001). Analysis of incomplete climate data: Estimation of mean values
and covariance matrices and imputation of missing values. Journal of climate.
Smola, A. and Sch, B. (1998). A tutorial on support vector regression.
Stubbs, A. (2002). Modelling ecological data using machine learning. Master’s thesis,
School of Informatics, University of Edinburgh.
van Wijk, M. and Bouten, W. (1999). Water and carbon fluxes above european
coniferous forests modelled with artificial neural networks. Ecological Modelling,
(120):181–197.
Vapnik, V. (1995). The nature of statistical learning theory. Springer Verlag.
Vapnik, V. (1998). Statistical Learning Theory. John Wiley.
Vrugt, J., Bouten, W., Dekker, S., and Musters, P. (2002). Transpiration dynamics of
an austrian pine stand and its forest floor: identifying controlling conditions using
artificial neural networks. Advances in water resources, (25):293–303.
Wallet, B. C. and et al. (1996). A genetic algorithm for best subset subset selection in
linear regression.
Whigham, P. and Recknagel, F. (2001). An inductive approach to ecological time series
modelling by evolutionary computation. Ecological Modelling, (146):275–287.
Yang, J. and Honavar, V. (1999). Feature subset selection using a genetic algorithm.
Artificial Intelligence Group, Iowa State University.