exploration of deep autoencoders on cooking...
TRANSCRIPT
Ghent University
Master Thesis
Exploration of deep autoencoders oncooking recipes
Author:
Lander Bodyn
Tutor:
Ir. Michiel Stock
Promoter:
Prof. Dr. Christophe Ley
Co-promoter:
Prof. Dr. Willem Waegeman
A thesis submitted in partial fulfilment of the requirements
for the degree of Master of Science in Computational Statistics
Department of Applied Mathematics, Computer Science and Statistics
January 2017
GHENT UNIVERSITY
AbstractMaster of Science in Computational Statistics
Exploration of deep autoencoders on cooking recipes
by Lander Bodyn
Deep autoencoders are a form of deep neural networks that can be used to reduce the
dimensionality of datasets. These deep autoencoder networks can sometimes be very
hard to train [1]. The gradient descent algorithm has been explored to train deep au-
toencoders on a dataset of cooking recipes. Minibatches, momentum and pretraining
were added as extensions to improve the gradient descent algorithm. The performance
of data reduction to two dimensions of the deep autoencoders was compared to singu-
lar value decomposition. The best deep autoencoder model obtained a cross entropy
loss of 0.048, much lower than cross entropy loss of 0.066 for singular value decomposi-
tion. From the two reduced dimension, the regions of the recipes were predicted using
the KNN and QDA algorithms. For the deep autoencoder models, the best predic-
tion accuracy was 65.4%, outperforming the best prediction accuracy of singular value
decomposition, 55.4%. The best prediction accuracy of the raw dataset was 69.8%,
suggesting that the deep autoencoders maintain the structure of the regions very well
in two dimensions. Using a deep autoencoder with data reduction to 100 dimensions,
the prediction accuracy was 72.0%, suggesting deep autoencoders might have some
usefulness for representation learning on this dataset. Dimensionality reduction tech-
niques can also be used as recommender systems, using collaborative filtering. Deep
autoencoder models were optimized to have the best retrieval rank of an ingredient
that was either removed from or added to an existing recipe. De Clercq et al. [2] have
built two similar recommender models on the same dataset: a non-negative matrix
factorization and a two-step kernel ridge regression model. The deep autoencoder
(mean rank = 25.2) outperforms the non-negative matrix factorization (mean rank =
33.0) and comes close in performance to the two-step kernel ridge regression (mean
rank = 23.6).
Acknowledgements
I would first like to thank my promoter Prof. Dr. Christophe Ley from the Faculty of
Sciences at Ghent University. He instantly accepted my plan to do a thesis in deep
learning and helped me by proofreading many parts of the thesis, suggesting which
parts I should explain more clearly. While my promoter is not accustomed to the field
deep learning, he assisted me to find a co-promoter that could guide me in the practical
parts of the thesis.
This brought me to my co-promoter Prof. Dr. Willem Waegeman and supervisor Ir.
Michiel Stock from the Faculty of Bioengineering at Ghent University. I would like to
thank both for coming up with a very interesting thesis subject, proof-reading several
parts of the thesis and continually guiding me during the thesis.
I also want to thank my friend Giancarlo Kerg, who inspired me to start my master
in Computational Statistics, as a foundation to move towards the field of machine
learning and deep learning in specific.
I also want to thank the company Yazzoom, where I could do an internship in deep
learning during my thesis. Some of the skills I learned at Yazzoom helped me to make
progress in my thesis.
Finally, I want to thank my family, who have supported me throughout the whole
process. Special thanks to my grandma, who made my lunch and dinner every day
and my parents, who endured all the fluctuations in my mood during the writing of
my thesis.
iii
Contents
Abstract ii
Acknowledgements iii
Contents iv
1 Introduction 1
1.1 Theoretical background . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 The cooking recipes dataset . . . . . . . . . . . . . . . . . . . . . . . 2
2 Methods 4
2.1 From artificial intelligence to deep learning . . . . . . . . . . . . . . . 4
2.1.1 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Representation learning . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Deep autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Network architecture . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Singular value decomposition for dimension reduction . . . . . 12
2.2.3 Deep autoencoders for dimension reduction . . . . . . . . . . . 13
2.2.4 Deep autoencoders for representation learning . . . . . . . . . 15
2.2.5 Deep autoencoders for collaborative filtering . . . . . . . . . . 18
2.3 Training the network with gradient descent . . . . . . . . . . . . . . . 19
2.3.1 Local minima . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2 The vanishing gradient problem . . . . . . . . . . . . . . . . . 21
2.3.3 Initialisation of the network parameters . . . . . . . . . . . . . 23
2.3.4 Minibatch gradient descent . . . . . . . . . . . . . . . . . . . 23
2.3.5 Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Optimisation of the hyperparameters . . . . . . . . . . . . . . . . . . 25
2.4.1 The gradient descent hyperparameters . . . . . . . . . . . . . 26
2.4.1.1 Learning rate δ . . . . . . . . . . . . . . . . . . . . 27
2.4.1.2 Batchsize . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.1.3 Inertia α . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.1.4 Initialisation range . . . . . . . . . . . . . . . . . . . 29
iv
Contents v
2.4.2 The network architecture . . . . . . . . . . . . . . . . . . . . 30
2.5 Python and the Theano package . . . . . . . . . . . . . . . . . . . . 30
2.5.1 Backward propagation of the gradient . . . . . . . . . . . . . . 30
2.5.2 Other packages . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 Results 32
3.1 Training the autoencoders . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.1 Adding extensions to the gradient descent algorithm . . . . . . 32
3.1.2 Plateaus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Comparing data reduction methods . . . . . . . . . . . . . . . . . . . 36
3.2.1 Singular value decompostion . . . . . . . . . . . . . . . . . . . 36
3.2.2 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Prediction of the regions . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 Collaboratorive filtering for recipe creation . . . . . . . . . . . . . . . 43
3.4.1 Reconstruction of the removed ingredient . . . . . . . . . . . . 43
3.4.2 Elimination of the added ingredient . . . . . . . . . . . . . . . 44
4 Conclusion and discussion 46
4.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
A Admission for circulating the work 48
Bibliography 49
Chapter 1
Introduction
1.1 Theoretical background
An autoencoder is a type of artificial neural network. When a neural network has
several hidden layers, the network is called a deep network. The gradient descent
algorithm is currently the dominant way of training neural networks. It can however
sometimes be difficult to train neural networks using the gradient descent algorithm;
this is especially true for deep autoencoders [1].
Autoencoders are designed to reduce the dimensionality of the dataset while minimizing
a reconstruction error. They can be seen as a non-linear extension of the linear data
reduction method singular value decomposition. Data reduction methods are useful to
obtain visualisations of the data in two or three dimensions. Dimensionality reduction
has also other applications. For example, the reduced features can be more suitable
for a machine learning task than the original features.
Since the increase in availability of datasets of cooking recipes online, machine learning
is starting to play a prominent role in tasks such as food preference modelling. Having
an algorithm that could combine left over ingredients to create a good recipe would
be a useful application. De Clercq et al. [2] built two such recommender systems on
a dataset containing the ingredients of recipes. For the recommender systems, the
authors used a non-negative matrix factorization model and a two-step kernel ridge
regression model. Deep autoencoders can also be used as a recommender system: in
order to reduce the ingredients of the recipes, meaningful features of the recipes will
have to be learned. A selection of ingredients can be reconstructed by the autoencoder,
1
Chapter 1. Introduction 2
after which the selection will resemble the recipes from which the autoencoder has
learned its parameters.
1.2 Overview of the thesis
In the thesis, it was explored how deep autoencoders can be optimally trained with the
gradient descent algorithm on a dataset of cooking recipes. To speed up the gradient
descent algorithm and improve its performance, two extensions were added to the
algorithm: minibatches and momentum. Aside from improving the gradient descent
algorithm itself, pretraining of the network parameters was implemented as another
tool to facilitate convergence to a good solution.
The deep autoencoder models were compared to singular value decomposition (SVD)
for the purpose of data reduction. The performance of the models was measured using
a reconstruction error. It was also examined how well both methods maintained the
structure of the data, by visually checking if recipes with similar regions of origin lay
close together on the biplots of the reduced features. The regions of the recipes were
then predicted, using the KNN and QDA algorithms on the reduced features. This
predicting might even be better than prediction models using the original dataset:
data reduction algorithms can possibly make the data more suitable for the prediction
task. The thesis also explored the use of deep autoencoder models as recommender
systems. The same dataset and performance measures of De Clerq were used, to
enable comparison with their recommender models [2].
1.3 The cooking recipes dataset
The data for the thesis was obtained from Ahn et al. [3]. Recipes with less than three
ingredients were removed as was done for the recommender systems of De Clercq et
al. [2], in order to enable comparison with the autoencoder recommender system. The
cleaned dataset contains 55001 different recipes using 381 ingredients. Each recipe
was represented by a binary vector: ones denote the presence of ingredients, zeros
denote the absence of ingredients. As each recipe contains only a small selection of
the ingredients, the data-matrix is a sparse matrix with a filling degree of 2.16%.
Chapter 1. Introduction 3
The region of origin of the recipes is included in the dataset. In total there are eleven
regions. Recipes of North American origin are the largest category, taking up 73,4%
of the recipes. Due to the short history of the North American recipes, most recipes
of North American origin are imported versions of recipes from all over the world.
Because of this, the North American recipes will be removed for the prediction of the
region of origin but not for the training of the autoencoder.
Of the 55001 recipes, 2500 will be set aside for validation and 2500 for testing. The
remaining recipes are used for training. The validation set will be used to determine the
convergence of the gradient descent algorithm as well as to select the optimal model
for the collaborative filtering. The test set will be used to test the performance of the
collaborative filtering. For the prediction of the origin, the dimensions of the whole
dataset (without the North American recipes) will be reduced with the autoencoder.
This dataset will then be split in 70% training data and 30% test data for the supervised
machine learning algorithms. There is no problem in using the training and validation
data of the autoencoder for the supervised machine learning problem: the autoencoder
is an unsupervised algorithm that does not require the values of the regions for training.
Chapter 2
Methods
2.1 From artificial intelligence to deep learning
Artificial intelligence (AI) is the field of making computers intelligent. Since pro-
grammable computers were first conceived, people have been wondering whether such
machines might become intelligent. In the early days of artificial intelligence, rapid
progress was made in logical tasks that can be easily defined with mathematical rules.
Since humans are typically not very good at these tasks, it did not take very long
before the computer started to outperform humans at those tasks. One such a task
is the logical board game chess, in which the IBM supercomputer Deep Blue defeated
the former world champion Gary Kasparov in 1997.
Ironically, many tasks that seem trivial for humans, like processing visual and auditive
information, are very hard for a computer to solve. The real challenge in artificial
intelligence turned out to be solving these kinds of intuitive problems. It is only
recently that some of these problems have been solved: for example in march 2016 did
the AlphaGo program of DeepMind manage to defeat Lee Sedol, the world champion
of go [4]. Go is a board game similar to chess, but with a lot more board positions.
Being a good player at go requires a lot of spatial and intuitive thinking, a task that
is very hard to write down in logical rules.
There have been many approaches to solving the challenges in artificial intelligence.
One of them is the knowledge-based approach. The knowledge-based approach tries
to make computers intelligent by hard coding different knowledge rules by hand. An
example of this is the Cyc project, which has as goal to enable AI applications of
4
Chapter 2. Methods 5
human-like reasoning [5]. These types of efforts have not been very fruitful however.
It turned out to be very hard to compose logical rules that capture all of the complexity
of human reasoning. In Figure 2.1 the relation between AI and several of its subfields
are shown. These subfields will be discussed in the following subsections.
Figure 2.1: A Venn-diagram explaining the relation between artificial intelligence,machine learning, representation learning and deep learning. For each subfield an
exclusive example is given [6].
2.1.1 Machine learning
Machine learning was invented as a different approach to artificial intelligence. Instead
of trying to hard-code everything, the programmer will define algorithms by which
the computer can extract its own logical rules from given data. Nowadays, machine
learning is used everywhere and has many applications. For example, logistic regression
can determine whether to recommend cesarean delivery [7]. Within machine learning,
there are two main categories of algorithms: supervised and unsupervised algorithms.
Chapter 2. Methods 6
With supervised algorithms, the goal is to make a prediction about a desired outcome
variable, given some input variables (features). The algorithm will do this by modelling
the factors in the data that are responsible for variation in the outcome variable. In
order to learn the optimal parameters of the model, the algorithm will need some
training data to learn from. The parameters will be learned to give the best predictions
of the outcome variables. Since the training dataset is only a sample of the true data
generating process, it will contain some random fluctuations. If there is not enough
data or if the model is too complex (has a lot of parameters), it is possible that the
algorithm will model some of these random fluctuations. This is called overfitting and
is not desired: modelling the random fluctuations in the training set will not generalise
to new data. In order to obtain a real measure of the performance of the algorithm, the
model has to be tested on a separated data set, called the test data. The difference in
performance between the train and test dataset can help to decide on the complexity
of the model to prevent overfitting. Supervised machine learning algorithms are often
split into two types, depending on the outcome variable. If the desired outcome value
of a supervised algorithm is discrete, one will speak of a classification problem. For a
continuous outcome variable the algorithm is called a regression problem.
Unsupervised algorithms do not have a desired outcome variable. Instead, these
types of algorithms try to find structure in the data. Unsupervised algorithms will for
example try to find clusters in the data or try to reduce the number of dimensions in
the data set.
In all machine learning algorithms, the performance of the algorithms relies heavily on
the construction of relevant features. Raw data, like the pixels of a picture, might not
contain much correlation with the desired outcome. It is only by designing intelligent
features that such algorithms gain a lot of power. However, the creation of these
features can be very complex and time consuming for the programmer.
2.1.2 Representation learning
In machine learning, the field of representation learning will not only use algorithms to
learn a desired outcome from some hand-crafted features, but will also use algorithms
to learn preferable features for the given task. For this purpose unsupervised learning
algorithms like singular value decomposition and shallow autoencoders can be used.
These algorithm will reduce high dimensional data to meaningful features, after which
these features can be used for a supervised machine learning task.
Chapter 2. Methods 7
The thesis will explore the use of deep autoencoders. While deep autoencoders are a
type of unsupervised algorithms that can be used for representation learning, they are
also a type of deep learning algorithm. Shallow autoencoders on the other hand can
be used for representation learning but are not a part of the deep learning algorithms,
as shown in the Venn diagram of Figure 2.1. The different between shallow and deep
algorithms will be explained in the next subsection.
2.1.3 Deep learning
In representation learning, the computer will learn relevant features from the data and
use these features to predict the outcome. In deep learning, several layers of features
will be stacked on top of each other, in order to create much more complex features,
which can then be used to predict the outcome. The term ‘deep’ refers to the depth
of the layers of features that are built upon each other. In certain complex artificial
intelligence tasks, this approach can be very powerful.
In Figure 2.2 an example is shown of deep learning applied to object recognition in
images. The pixels of the image are given as the input layer. On top of the input layer,
several hidden layers are built from which eventually the type of object is predicted.
These layers are called hidden because they are not given as input or used as output,
instead they will be constructed by the algorithm itself. In the figure one can really see
what the computer is trying to learn: in the first hidden layer it will try to recognise
relevant low level objects like edges and color gradients. In the second hidden layer it
will use those objects to construct shapes like corners and contours. In the next layer
those shapes will be used to construct whole object parts. Finally, in the output layer
the object identity will be predicted from the object parts.
Another name often given to deep learning is artificial neural networks. This name
originates from some of the first implementations of deep learning algorithms in the
1940’s. Back then, researchers were using these types of algorithms in neuroscience
as computational models to learn how our own brain works. The researchers were
in fact trying to simulate the algorithm that the human brain uses to learn on the
computer. From an artificial intelligence perspective, it also makes sense to study
these models, since we know they can produce intelligence in humans. Nowadays
neuroscience has a diminishing influence on the progresses in deep learning research.
A lot of the terminology of neuroscience models still exists today however, like the
word ‘neuron’ for the features in the different layers.
Chapter 2. Methods 8
Figure 2.2: Visualisation of a convolutional neural network, each layer buildingon the features of the previous layer [8].
Although the field of deep learning has been around for a long time, it has only recently
become very popular. Progress in computing power, together with big amounts of data,
has made it possible for deep learning algorithms to outperform other simpler machine
learning algorithms on several AI tasks:
• On the MNIST digit image classification problem, deep learning managed to
break the supremacy of support vector machines [9] [10].
• Microsoft’s 2012 version of their audio and video indexing speech system (MAVIS)
based on deep learning managed to reduce the word error rate by about 30%
compared to state-of-the-art models based on Gaussian mixtures [11].
• In natural language processing, the SENNA software which has applications in
tasks such as language modelling, semantic role labelling and syntactic parsing,
approaches or surpasses the state-of-the-art on these tasks and is simpler and
much faster than traditional predictors [12].
Chapter 2. Methods 9
2.2 Deep autoencoders
In the thesis deep autoencoders are explored on cooking recipes. Autoencoders are a
type of unsupervised machine learning algorithm: instead of trying to predict a certain
outcome, autoencoders will try to reconstruct their own inputs. If hidden layers have
more neurons than the input layer, the autoencoder can potentially learn the identity
function. Such a reconstruction is not very useful. However, if the network has at
least one hidden layer with a number of neurons lower than the input layer, the use of
an autoencoder becomes more interesting. In this case, the network will have to learn
a compact description of the data in such a way as to retain as much information as
possible, despite the reduced number of dimensions. Figure 2.3 shows a visualisation of
the structure of a deep autoencoder network. The hidden layer with the least number
of neurons is called the bottleneck hidden layer. The structure of an autoencoder will
often be symmetric, with the bottleneck hidden layer in the middle. Although this is
not a strict rule, the reason behind this is very intuitive: if it requires a certain amount
of complexity (number of layers) to encode the inputs to the bottleneck layer, the
decoding back to the output layer would likely require a similar amount of complexity.
Figure 2.3: A visualisation of a deep autoencoder with a central bottleneck hiddenlayer. The output layer of an autoencoder tries to reconstruct his input layer.
Chapter 2. Methods 10
2.2.1 Network architecture
In this subsection the elements of the architecture of an autoencoder network will be
discussed. As mentioned before, each layer is constructed from the previous layer. The
exact definition of this construction can be found in the equation:
yk = f
(∑i
wkixi + bk
). (2.1)
Each neuron yk of the next layer can be defined as a function f over the weighted
sum of the neurons of the previous layer. The weighted sum can also contain a bias
term bk. This is equal to saying that the previous layer has an extra neuron x0, with a
value identical to one. Note that for each neuron in the next layer, a different set of
weights wki is used, in order for each neuron to learn a different feature. If we have m
neurons in a layer and n neurons in the next layer, the number of network parameters
between the two layers will thus be given by mn or (m + 1)n when a bias term is
added. The function f is called the activation function. Some typical examples of
activation functions are:
• The linear function: f(x) = x;
• The sigmoid function: f(x) = 11+e−x ;
• The rectifier function: f(x) = max(x, 0).
The activation function of a neural network is often very simple. The complexity is
not generated by using very complex functions, but rather by combining several simple
functions to build layers of features on top of each other, each layer increasing in
complexity. There are however important differences between the activation functions.
The linear activation function is a special case among the activation functions of neural
networks. One of the properties of a linear function is that the combination of two
linear functions is again a linear function. In a neural network, this means that the
addition of an extra linear layer to another linear layer will be equivalent to just one
linear layer. A network with only linear activation functions will thus not be able to
learn complex (non-linear) features. The whole concept of deep learning would not
work here, so other activation functions will have to be added to the network.
Chapter 2. Methods 11
−10
−5
0
5
10
Linear
Function
−0.20.00.20.40.60.81.01.2
Function derivative
−0.20.00.20.40.60.81.01.2
Sig
moid
−0.050.000.050.100.150.200.250.30
−10 −5 0 5 10−2
02468
10
Rect
ifie
r
−10 −5 0 5 10−0.2
0.00.20.40.60.81.01.2
Figure 2.4: A plot of different activation functions and their derivatives.
The sigmoid function is another function often used in neural networks. It is a non-
linear function and can thus be used to build a deep network. It also has the nice
property of being a monotonically increasing function that maps [−∞,∞] to [0, 1]
as can be seen in Figure 2.4. In the data set of the thesis, the inclusion of different
ingredients in the recipes is coded as 0/1. We could use this knowledge about the data
to build an appropriate architecture for the autoencoder. By using a sigmoid activation
function for the output layer, the values of the output will be restricted to the interval
[0, 1]. This way, the autoencoder will have a much easier time reconstructing his input
values.
The rectifier function is another non-linear activation function and can thus be used to
build complex features, just like the sigmoid function. But unlike the sigmoid function,
the rectifier function has a property that is very useful to train neural networks: the
derivative of the function is non-vanishing for a large region of the parameter space
(all values x > 0). This is not the case for the derivative of the sigmoid function,
which is only significantly greater than zero for parameter values close to zero. This
will be very useful to prevent the vanishing gradient problem, which will be discussed
in the next section.
Chapter 2. Methods 12
The output of the network as function of the input layer is then defined as the chaining
of the activation functions of the different layers. On the output layer a loss function
will be defined. This loss function is used to optimize the network for the task at hand.
The loss function J(θ) of a general neural network is given by:
J(θ) =1
n
n∑i=1
L(g(x(i);θ),y(i)). (2.2)
In an autoencoder, the network will try to reconstruct its own inputs: x(i) = y(i). In
this equation, g is the chaining of the activation functions over the different layers.
The function L is called the cost function. The cost function determines how the
deviations from the values are penalized. The two cost functions explored in the thesis
are:
• The least squares function: L(y,y) =p∑j=1
(yj − yj)2
• The cross entropy function: L(y,y) =p∑j=1
yj log yj + (1− yj) log(1− yj)
The least squares function is one of the most used cost functions in machine learning,
many of the algorithms optimize this cost function. However when the outcome is
restricted to the interval [0, 1] by the use of a sigmoid activation function, it makes
much more sense to use the cross entropy. With a least square cost function, there will
not be much difference in cost between an output of 0.01 and 0.0001 while this is a big
difference for the arguments of the sigmoid function. The cross entropy will penalize
these differences exponentially, giving much more information to learn from, when
optimizing the network. This will be important for the vanishing gradient problem
which is discussed in the next chapter.
2.2.2 Singular value decomposition for dimension reduction
As stated above, when a bottleneck layer with a low number of layers is introduced
in the autoencoder, the reconstruction will in general not be perfect anymore. The
autoencoder can then be used to reduce the dimensions of the data while maintaining
as much information as possible. Within machine learning, there is another set of
techniques famous for being able to reduce data: principle component analysis (PCA),
which is a version of singular value decomposition (SVD). The SVD on the data matrix
Chapter 2. Methods 13
M with n rows representing the observations and p columns representing the features
is given by the equation:
M = UDV T . (2.3)
with D a diagonal matrix with the non-negative singular values on the diagonal, ordered
from large to small. The singular values represent the amount of variation of the data
in their corresponding direction. The matrices U and V are called the left-singular and
right-singular matrices of M respectively. The columns of the matrix V span the space
of the decomposed features. If we write the matrix containing the first k columns of
V by Vk, we can reduce our dataset by projecting the features onto the subspace Vk
using the equation:
Zk =MVk. (2.4)
Because the first k columns of V have the largest singular values of the decomposition,
the reduced features Zk will be the features that capture the most variation in the
dataset. This variation is the variation observed from the zero point. The data is
reconstructed by projecting the reduced features Zk back on the original p-dimensional
feature space using the equation:
Mk = ZkVTk . (2.5)
This reconstruction will be incomplete: the data points will now lie in a k-dimensional
subspace of the p-dimensional feature space. PCA is defined as the eigenvalue decom-
position of the data covariance matrix. PCA will transform the features to (linearly)
uncorrelated features. The decomposed features will be ranked in decreasing order of
variation. This variation is the variation observed within the data. When the data is
centered, all the data will vary around zero. In this case, SVD and PCA will lead to the
same principle component directions. Aside from centring, the SVD/PCA algorithms
are also very sensitive for the normalisation of the data.
2.2.3 Deep autoencoders for dimension reduction
If we choose to use a linear activation function in Equation 2.1, the next layer in
our network is defined as a linear transformation of the previous layer. Equations 2.4
and 2.5 of singular value decomposition are also both linear transformations. Fur-
thermore, the projection parameters Vk are optimized to retain the most variation
possible. This is equal to minimizing the Frobenius norm of V - Vk for a fixed k, with
Chapter 2. Methods 14
the Frobenius norm given by:
||M ||F =
√√√√ n∑i=1
p∑j=1
M2ij. (2.6)
This equation is exactly the same as the loss defined by Equation 2.2 when using a least
squares cost function, apart from a constant factor. This shows that an autoencoder
with a least squares cost function, linear activation functions, no bias terms and a
bottleneck hidden layer with k neurons will do exactly the same thing as an SVD with
k dimensions!
Singular value decomposition can thus be seen as a special case of a shallow autoen-
coder. Using autoencoders to perform dimension reduction has the benefit that it can
be generalised to multiple non-linear layers and learn deep features. Where singular
value decomposition will project the data on linear manifolds, autoencoders will be
able to extend this to curved manifolds. This extension will be especially beneficial for
complex, non-linear data.
Figure 2.5: Dimension reduction on 20x20 images of digits using 30 dimensionaldeep autoencoders and principle component analysis [1].
In Figure 2.5 an example of reduction on the MNIST digit data set is shown. The
MNIST has 70 000 images of digits with dimensions 20× 20, thus 400 pixels in total.
The first row of the figure shows an example of each digit in the dataset. The second
and third rows show the reconstruction of each digit using a deep autoencoder and
PCA respectively. For both, the data was reduced from 400 pixels to 30 dimensions
that will contain the most important features of the digits. From these 30 dimensions,
the original 400 pixels will be reconstructed. The PCA does a reasonable job of
reconstructing the dataset, but it is not very spectacular. The autoencoder does very
Chapter 2. Methods 15
well in reconstructing the original dataset. In fact, it does arguably better than the
original digits! For example: the upper loop of the digit eight has been fixed, as well as
both ends of the digit zero have been better connected together. In order to preserve
as much information as possible while reducing the dimensions of the dataset, the
autoencoder will have to learn complex features of the data. An unclosed loop in the
digit eight is a rarity in the data set and as a result, the autoencoder did not learn this
feature when only having access to 30 dimensions. Instead, it knows the digit looks a
lot like the other eights and will reconstruct a more general eight.
Dimensionality reduction has a lot of applications. One of them is data visualisation.
When the data is reduced to two dimensions, those dimensions can be visualised on
a biplot. Latent semantic analysis (LSA) is a domain that makes often use of such
biplots. LSA is a natural language processing technique that analyses the link between
words and the documents they originate from. The reasoning behind this is that
words that are similar in meaning will be used in similar contexts. A matrix can be
constructed using a collection of the documents with the frequencies of their most
important words as features. Figure 2.6 shows the biplot of the reduction of such a
matrix to two dimensions using SVD (B) and a deep autoencoder (C). After the data
reduction, the documents are coloured by the type of their content. Compared to the
SVD biplot, the deep autoencoder contains much more structure: documents that are
similar to each other are closer together. When the cosine of the angle between two
codes was used to measure similarity, the autoencoder clearly outperformed SVD (A).
The autoencoder for LSA can also be used in another way: instead of reducing to two
continuous features, one could build an autoencoder that has 32 binary bottleneck
neurons. Each document can then be compressed to a 32 long bit sequence. As
a result, documents that are very similar in content will be very similar in the bit
sequence. Each document has then a hash given by its bit sequence. Such a hash can
be used for fast retrieval of documents with similar content. Using neural networks
to hash and retrieve documents, called semantic hashing, is much faster than other
hashing algorithms [13].
2.2.4 Deep autoencoders for representation learning
Reducing features of a dataset to more usable features for supervised machine learning
is a form of representation learning. Although data reduction will throw away some
Chapter 2. Methods 16
Figure 2.6: Reduction of words using singular value decomposition (B) and deepautoencoders (C) and the document retrieval accuracy for these methods (A) [1].
information about the data, the reduced features might be more usable for the pre-
diction task. Features that are useless for the prediction task might also be reduced,
which might prevent overfitting. In the thesis, deep autoencoders will be explored to
reduce the ingredients of cooking recipes. With these reduced features, supervised
machine learning will be performed to predict the region of origin of the recipes. Two
commonly used supervised machine learning algorithms will be used for this purpose:
k-nearest neighbors and quadratic discriminant analysis.
k-nearest neighbors (KNN) is one of the simplest supervised machine learning algo-
rithm. For each point in the parameter space, the k-nearest neighbors in the training
dataset are determined. For classification, the prediction of the outcome variable for
a given observation will then be given by the majority vote of the outcome variables
of the nearest neighbors. To determine the observations nearest to a given point, a
distance measure must be defined. Very often this distance measure will just be the
Euclidean distance. The algorithm is very sensitive to how the features are scaled, since
features that are much larger than others will easily dominate the distance measure.
The features produced by dimension reduction with the SVD or autoencoder all have
the same dimensions, scaling the features before use will thus not be important. The
Chapter 2. Methods 17
Figure 2.7: An example of the prediction of three supervised machine learningalgorithms: LDA (left), QDA (right) and KNN (below).
value k of the KNN algorithm will be determined by a 5-fold cross-validation on the
validation data.
Linear discriminant analysis (LDA) is another common supervised machine learning
algorithm. LDA is closely related to PCA: where PCA is an unsupervised machine
learning algorithm that tries to find the directions in the data with the most variation,
LDA is a supervised algorithm that will try to find the directions in the data with
the most variation in the outcome variable. In these directions of greatest variation,
the middle points will be determined. The outcome variable will then be predicted
depending on which side of the middle points the data lies. Quadratic discriminant
analysis (QDA) is an extension of this method that does allow for the classes to have
different covariances, as will be the case for the regions of the recipes. For this reason,
QDA will be used instead of LDA.
Chapter 2. Methods 18
2.2.5 Deep autoencoders for collaborative filtering
Data reduction methods can also be used for the purpose of collaborative filtering
to build recommender systems. An example of recommender systems can be found
in the services Netflix and Amazon, where products will be recommended that might
be interesting to the specific user. Collaborative filtering algorithms try to solve this
problem from the following perspective: using the preferences of a customer, what
other products can be recommended to that customer based on other customers with
similar preferences? Data reduction methods can here be used to learn how the pref-
erences of the customers look like: in order to have a good reconstruction, meaningful
features will have to be extracted from the data, while unimportant features will be
thrown away. The observations will be reconstructed to match better with the other
observations. An example of this was already seen in the reconstruction of the digits
zero and eight in the Figure 2.5.
Collaborative filtering can be used on the recipe dataset to create a good recipe from
a selection of ingredients that do not necessarily form a good recipe to start with.
A deep autoencoder must first be trained on a training dataset containing a lot of
recipes. The autoencoder will learn from this dataset how ingredients are combined
in the recipes. The trained network can then be used to recommend adaptations to
the selection of ingredients. These adaptations can simply be done by putting the
selection of ingredients through the network and checking the output. The output
of the autoencoder was forced to have values between zero and one by using the
sigmoid function. Therefore, if an ingredient does not match well with the rest of
the ingredients, this ingredient will have an output close to zero. On the other hand,
ingredients that were not present in the in the selection but would match well, will
have an output more towards one.
The performance of the autoencoder can be tested on the test dataset that has been
set aside for this purpose. Each recipe in the test data will be modified by randomly
adding or removing one ingredients. Adding an ingredient is done by changing the
value in the recipe from zero to one, while removing an ingredient is done by changing
the value from one to zero. The deep autoencoder can then be used to determine
which ingredient has been modified by examining the reconstruction of the ingredi-
ents. If an ingredient was removed, the ingredients that were not included in the
modified recipe were ranked in decreasing order on their reconstruction values. If an
ingredient was added, the ingredients that were included in the modified recipe were
Chapter 2. Methods 19
ranked in increasing order. In both cases, the rank of the changed ingredient will be
determined. A low rank implies that the recommender system did well in finding back
which ingredient was changed.
From the ranks of the changed ingredients, several performance measures can be
extracted. For the ranks of the removed ingredients, the mean rank, median rank and
the percentage of recipes with a reconstruction ranking in the top 10 have been used
as performance measures. These performance measures are the same measures that
have been used in De Clercq et al. [2], to enable comparison with their recommender
systems. For the ranks of the added ingredients, the mean rank, median rank and the
percentage of ingredients on the first rank are used.
2.3 Training the network with gradient descent
In the previous section, the different elements of the network architecture have been
explained. In this section, the optimization of the network parameters will be discussed.
Unlike some other machine learning algorithms, it is generally not possible to find an
analytical solution for the parameters of a neural network. However the network can be
optimized using the gradient descent algorithm. The gradient descent algorithm starts
from a certain initialisation of the parameters. From this start point, the direction of
steepest descent of the loss function will be determined. This direction will be given
by the opposite of the gradient of the loss function. A small step will be taken in
the direction of the steepest descent. If this step is small enough, the loss should
be smaller for this new set of parameters. This procedure will be repeated until the
algorithm stops improving the loss, around the global minimum if all goes well. The
exact equation of how to update the network parameters is given by:
θ ← θ − δ∇θJ(θ). (2.7)
The gradient symbol ∇θ represents the vector containing the derivatives of the pa-
rameters θ. This equation also introduces a new parameter δ, which is called the
learning rate. The learning rate determines the size of the steps that will be taken
in the direction of steepest descent. The gradient, which determines the direction of
steepest descent, is also responsible for the size of the steps. This will also help to
converge to a global minimum: if the loss function behaves well enough (has a con-
tinuous derivative) around the convergence point, the gradient will diminish around
Chapter 2. Methods 20
this point, slowing down the gradient descent algorithm. Figure 2.8 shows how the
gradient descent algorithm can converge from a certain initial position to the global
minimum.
Figure 2.8: A visualisation of the gradient descent algorithm on a loss functionJ with two parameters θ1 and θ2 [14].
2.3.1 Local minima
When using the gradient descent algorithm to find the optimal solution of a neural
network, there are a number of potential problems that can arise. When the activation
functions of the network are non-linear, the loss function will in general be non-convex.
This means that there will be several local minima in the loss function. It is very possible
for the gradient descent algorithm to converge to one of the local minima instead of
the global minimum by starting from a different position, as depicted in Figure 2.9.
In practice, most local minima do not play a big role in the training and application
of neural networks. These local minima are only abundant in regions of the parameter
space which have a loss close to the loss of the global minimum. For practical appli-
cations, it does not matter if the solution is a local or global minimum, as long as the
loss is close enough to the global minimum.
Chapter 2. Methods 21
Figure 2.9: A different initial position can lead to convergence to a local minimainstead of the global minimum [14].
2.3.2 The vanishing gradient problem
There are however some special local minima that can be detrimental to proper con-
vergence of the gradient descent algorithm. As mentioned before, the output of the
network is given by the chaining of the activation functions of the different layers. To
find the derivative of a chained function, the chain rule can be used as defined in the
equation:
f1(f2(x))′ = f ′2(x)f
′1(f2(x)). (2.8)
As can be seen in the equation, the derivatives of the parameters of a certain layer
(f1(f2(x))′) will depend on the values of derivatives of the parameters of the next
layer (f ′1(f2(x))). If the derivatives of all the parameters of a certain layer have very
low values, the derivatives of all the preceding layers will also have very low values. As
a result, the network parameters will hardly update in those layers and the gradient
descent algorithm will not be able to converge to a good solution in a reasonable
amount of time.
This situation is especially likely to occur in deep autoencoders [1], for two reasons:
a high number of layers and a small number of neurons in the bottleneck layer. A lot
of layers can be problematic if the derivatives of the activation functions are smaller
than one, as is the case for the sigmoid function for example. Since the derivatives of
the parameters of a certain layer will be the product of derivatives of the activation
Chapter 2. Methods 22
functions of all the next layers, the gradient for the parameters in the first layers can
become very small. This problem is known as the vanishing gradient problem [15].
The bottleneck layers of autoencoders makes the problem even worse. Due to the low
number of neurons in this layer, it is much more likely for all the derivatives of the
bottleneck neurons to become very small, preventing the gradient descent algorithm
to properly work in the layers before the bottleneck layer.
A similar situation occurs when all but one of the bottleneck layer have gradients close
to zero. The gradient descent algorithm will still be able to do some learning through
that one neuron, but the learning will be limited. It is possible that the learning through
this one neuron is not enough to start activating the other neurons during the process.
There are several solutions for these problems. One of them is using pretraining to find
good initial values of the parameters, after which the gradient descent algorithm will
easily converge to a good value [16]. This pretraining is usually done layer by layer,
using restricted Boltzmann machines, an unsupervised deep learning algorithm. In the
thesis, pretraining of difficult network architectures was done using the neuron values
of trained networks with the same amount of layers and neurons, but easier to train
activation functions. Although there is no obvious explanation why this worked by the
author, it was a solution that worked for those network architectures where the normal
initialisation procedure (discussed in the next subsection) failed.
Another solution to the problem of plateaus and the vanishing gradient problem is the
use of appropriate activation functions. The rectifier function derivatives are either zero
or one, while the sigmoid function only has derivatives smaller than one. The gradients
of deep neural networks using rectifier activation functions will as a result not diminish
towards zero unlike the sigmoid activation functions, preventing the vanishing gradient
problem that occurs with many layers of sigmoid functions. The rectifier does have
another problem however: having a zero derivative for all negative input values. For
autoencoders, this can make it likely to have a bottleneck layer where all neurons have
a gradient of zero. To solve this, some modifications of the rectifier function which
have a small but non-zero gradient for the negative input values are invented [17].
This problem only occured for the activation function to the bottleneck layer: using
a linear or sigmoid activation function for this layer prevented the problem. For the
architectures with a rectifier activation function to the bottleneck layer, choosing a
proper initialisation size of the parameters helped to make proper convergence much
more likely.
Chapter 2. Methods 23
2.3.3 Initialisation of the network parameters
One could naively initialise all the parameters to the same value, however this approach
would not work: if all the parameters have the same value, their gradients will also
have the same value. After every gradient descent step, they will still have the same
value. The gradient descent algorithm will therefore not be able to work properly.
To solve this problem, the parameters of the network can be initialised with random
values.
The size of the range of these random values is important. If the range of the random
values is too large or too small, the gradient descent algorithm will have to do a lot of
work to adjust the parameters to their appropriate sizes. This will take a lot of time to
compute and the algorithm is more likely to get stuck in one of the plateaus discussed
in the previous subsection.
2.3.4 Minibatch gradient descent
Although the gradient descent algorithm generally works well to converge to a good
solution, it can often take a very long time. One way to improve the algorithm
is to use minibatch gradient descent instead of deterministic gradient descent. In
deterministic gradient, the gradient is calculated over the whole dataset, before the
network parameters are updated with this gradient. This is often not very efficient:
in large datasets, the calculation of the gradient can take a very long time. However,
the accuracy of the gradient descent estimator only increases with a factor√n, with
n being the number of training samples in the dataset. There is also often a lot of
redundancy in the dataset: different data samples give very similar contributions to
the gradient.
Minibatch gradient descent will calculate the gradient only over a small batch of
training samples, typically 10 to 100 samples, before doing a gradient descent step.
For each iteration through the dataset, the order will be randomized and split in the
batches. In order to compensate for the reduced accuracy of the gradient, the learning
rate will have to be reduced. After the gradient descent step, the gradient will be
calculated over the next minibatch, and so on. The gradients calculated over the
minibatches are good enough for the gradient descent algorithm to work, but will
be calculated much faster, allowing the gradient descent algorithm to converge in a
fraction of the time it would take with deterministic gradient descent. In minibatch
Chapter 2. Methods 24
gradient descent with a very large training dataset, it is possible for the algorithm to
converge before the end of the dataset is even reached! In general, several epochs
through the dataset will be needed to converge to the best value.
Minibatch gradient descent also has another benefit. The randomness introduced
by the minibatches will have a regularising effect, lowering the degree of overfitting,
making the network generalise better [18]. This regularising effect is the strongest for
batches of size one, but training the network with batches of size one can also take a
very long time. In the case the batches have size one, the algorithm is called stochastic
gradient descent.
2.3.5 Momentum
Another improvement in the gradient descent algorithm can be done by adding a
momentum to the direction of descent. This method of momentum [19] can make
the algorithm converge faster and helps to prevent getting stuck in local minima.
In Equation 2.7 the changes in the network parameters θ are directly related to the
gradient of the loss function. With momentum, the gradient will be used to update a
momentum term for each network parameter as following:
v ← αv − δ∇θJ(θ). (2.9)
This momentum term can be seen as an exponentially decaying moving average of
the past gradients. The momentum term will then be used to update the parameters
using:
θ ← θ + v. (2.10)
The momentum method introduces a new parameter α, the inertia parameter. This
parameter can have values in the range [0, 1[. It is used to determine the fraction of
the previous momentum step that remains in the next step. If α is equal to zero, we
would have no momentum. In case α would be one, the contributions of the past
gradients would just keep adding up in the momentum parameter, possibly leading to
a diverging momentum. By using a value for α smaller than one, a decay is added to
the momentum. The momentum term will be the largest when all the past gradients
are orientated towards the same direction in the parameter space, in which case they
will amplify each other. The inertia will determine the upper bound for the size of
the momentum in relation to the size of the gradients. This factor can be found by
Chapter 2. Methods 25
substituting Equation 2.9 repeatedly in Equation 2.10, yielding
δ + δα + δα2 + δα3 + ... =δ
1− α. (2.11)
One can imagine the regions of same loss in the parameter space of a neural network
to look similar to concentric hyperellipses. Some axes of those hyperellipses will be
very short and other axes will be very long. The direction of the gradient in such a
configuration can then be almost perpendicular to the direction of the centre of the
ellipses. This will make the gradient descent algorithm very inefficient. In Figure 2.10
an example is shown of the gradient descent algorithm without momentum in an
elliptical loss function with a long and short axis.
Figure 2.10: Gradient descent without momentum.
Momentum solves this problem: along the axes where the gradient changes direction
often (short axes), the momentum will be diminished, while along the axes where the
gradient does not change often in direction (long axes), the momentum will grow in
size. Figure 2.11 shows the gradient descent algorithm with momentum with α = 0.5.
Figure 2.11: Gradient descent with momentum with an inertia α = 0.5.
2.4 Optimisation of the hyperparameters
The previous chapter discussed how the gradient descent algorithm can be used to
train neural networks. While the gradient descent algorithm works generally very well
Chapter 2. Methods 26
for this purpose, a lot depends on properly-chosen parameters for the gradient descent
algorithm. These parameters, different from the actual network parameters, are called
hyperparameters. The choice of the network architecture can also be viewed as being
part of the hyperparameters: the number of layers and neurons in each layer, the
choice of activation functions and loss function and the choice to add a bias term
or not. As mentioned in the subsection about machine learning, in order to test
the true performance of an algorithm, a separate test set needs to be used. The
hyperparameters will be optimized against a measure of the performance. However,
the test set cannot be used for this purpose: optimizing the hyperparameters is also
a form of training of the algorithm. Tuning the hyperparameters against the test
performance could result in overfitting towards this dataset. In such a case, the test
set would not give a realistic measure of the performance anymore. To solve this
problem, a second dataset will be separated from the training data, on which the
hyperparameters will be optimized. This dataset is called the validation dataset.
2.4.1 The gradient descent hyperparameters
The optimal values for the gradient descent algorithm have been first tuned manually
until a reasonable solution was found. After this, an algorithm was used to find the
optimal solution in the region of the solution found by hand.
Originally, grid search was used for this purpose. In grid search, for each parameter a
grid of several values is defined. After this, each parameter combination is executed to
find the best combination. While this approach worked well initially, it was not feasible
in the long term: the training of the more complex autoencoders could take up to
an hour. If for example we picked five values for each of the four gradient descent
parameters, it would take around 26 days to try them all out! Also, the possible values
of the parameters in a grid search are fixed to only five values, while the optimal value
might lie somewhere in between.
In this situation, a random search for hyperparameter optimisation works much better.
A random search will pick the values of parameters randomly in a predefined range of
possible values and will try to find the best set of parameters within a certain number
of attempts. Random search has the obvious benefit of allowing a continuous range
of parameters to try out. But random search has an even bigger benefit: it does
much better in finding a good solution in a multidimensional hyperparameter space
compared to grid search, within the same amount of time. This is because some
Chapter 2. Methods 27
hyperparameters will have less effect than others. Grid search will try out several
combinations of these less important parameters while keeping the other parameters
constant, whereas random search will change all hyperparameters on every draw.
In my thesis, this approach has been adapted a bit further: after each draw has
been tried out, the range of the hyperparameters changes to become centered around
the best solution so far. The size of the ranges have also been tuned manually to
become smaller, the closer the search became to the best solution (when the random
search slows down finding better solutions in the region of the best solution so far).
This adapted random search made it possible to find the optimal parameters for each
network architecture within a day. In the next subsections, the effect of the parameters
on the convergence of the gradient descent algorithm will be explained.
2.4.1.1 Learning rate δ
The learning rate is one of the most important parameters of the gradient descent
algorithm. If the learning rate is very high, it is possible for the gradient descent
algorithm to diverge in the loss function.
Figure 2.12: Divergence of the gradient descent algorithm with a too large learn-ing rate on a parabolic loss function [20].
Chapter 2. Methods 28
Figure 2.12 shows an example of such a divergence on a parabolic loss function.
Starting from a point on the parabola, taking a too large step in the direction of
steepest descent makes the parameter end up at the other end of the parabola. At this
new point, the gradient is even larger than the starting point. Since the size of the
gradient descent steps is also dependent on the size of the gradient, the next step will
be even larger. This step size can continue to grow this way and lead to divergence of
the loss function.
Figure 2.13: The effect of the different learning rates on the convergence of theloss function with the gradient descent algorithm [21].
But even if the loss function does not diverge under the gradient descent algorithm,
this does not mean that the parameters are well adjusted. In Figure 2.13 the loss
functions during the gradient descent are shown for different learning rates. If the
learning rate is too large, but small enough to converge, rapid progress will be made
towards the global minimum. However, it is possible for such a learning rate to flatten
out too early. Similar to the diverging learning rate, it will be constantly overshooting
the global minimum: the gradient descent will take a step towards the right direction
but too large in size, making it end up on the other side. The learning rate will never
really fully converging like a good learning rate would do in such a case. A too small
learning rate is another problem. The convergence of such a learning rate would take a
very long time. With a small learning rate it is also more likely for the gradient descent
algorithm to get stuck in local minima, saddle points or other flat regions in the loss
function.
Chapter 2. Methods 29
2.4.1.2 Batchsize
Another important parameter is the size of the minibatches for the minibatch gradient
descent. As discussed, using batches smaller than the whole dataset can speed up the
gradient descent algorithm a lot. But using a too small batch size can also slow down
the algorithm: the smaller the batch size, the more random the gradient will be. To
compensate for this added randomness, the learning rate will have to be decreased.
While smaller batches will reduce the need to calculate the gradient over a lot of data
points, more steps will be needed to converge. The optimal batch size will be found
by the best trade off.
2.4.1.3 Inertia α
Adding momentum to the gradient descent algorithm is another method by which the
algorithm can converge faster and avoid local minima. The size of the momentum
is determined by the inertia α, from which the upper bound 11−α of the stepsize in
relation to the gradient was derived in Equation 2.11. The momentum will increase
the stepsize by this factor in the directions that maintain the orientation of their
gradients. However, too much momentum will lead to oscillations and instabilities in
the gradient descent algorithm. Reducing the learning rate will help to prevent these
instabilities. The optimal value for the inertia will be found by the trade off between
boosting the directions that maintain orientation of their gradients and keeping a high
enough learning rate in the other directions.
2.4.1.4 Initialisation range
The network parameters need to be initialised at random. One important aspect of
this initialisation is the size of the range of these random values. This size needs to
match the size of the optimal network parameters in order for the gradient descent
algorithm to work well. If the initialisation of the parameters is too small or too large,
the gradient descent algorithm will have to work too hard, making it likely to converge
to one of the trivial local minima solutions discussed in the previous section.
Chapter 2. Methods 30
2.4.2 The network architecture
For each (deep) network architecture, it takes a long time to search for the best
gradient descent parameters and train the model with these optimal parameters. The
different network architectures have been optimized manually for the different purposes
of the thesis: using an algorithmic search would not have been feasible given the time
constraints.
2.5 Python and the Theano package
The thesis has been programmed in Python. Python is open source software and has
a large online community that extends the language with many user-written packages.
This makes it convenient for the individual to easily perform computing tasks such as
machine learning. One of the packages in Python that is especially useful for deep
learning is the Theano package [22].
Theano gives the user the option to define variables in a symbolical manner, leaving
calculations with those variables to its software. It will also compile the code to run
faster and give the option to run the code on GPU instead of CPU. GPU’s have the
possibility of running code in parallel. Calculations such as finding the gradients of
each observation of a batch can easily be parallelized, making the algorithm run much
faster on GPU. This option has not been explored in the thesis because of a lack of
compatible hardware. One of the most important aspects of the Theano package is
the efficient calculation of the gradient, the most important computational task when
training the neural network.
2.5.1 Backward propagation of the gradient
One of the limiting factors in the early years of neural networks was the very slow
calculation of the gradients. Because they were so slow to train, there was not a lot
of interest in researching them, halting a lot of the progress in this domain for several
years. To give an example of what it requires to calculate the gradients, imagine a
network with layers X, Y and Z as the last three layers. On the output layer Z a loss
function J(θ) is defined, which has to be optimized with gradient descent. As was
shown in Equation 2.8, the gradients of the network can be found with the chain rule.
Chapter 2. Methods 31
The partial derivative of weight wXij of the activation function going to the neuron Xj
can then be written as
∂J(θ)
∂wXij=∑k
∑l
∂Xj
∂wXij
∂Yk∂Xj
∂Zl∂Yk
∂J(θ)
∂Zl. (2.12)
For deep neural networks, this equation includes the calculation and summation over
many variables. A naive implementation would be to do this calculation for each neuron
separately, starting from the first layers. However, there is a way to calculate the
derivatives in a much faster way. Equation 2.12 for weights in the same layer contains
many of the same factors. Also across the different layers, many of the factors are the
same. For example, all the weights in the network will need the terms ∂J(θ)∂Zl
to calculate
its derivative. Instead of making these calculations over and over for each weight, one
could start from the last layer and calculate the derivatives. Subsequently, the weights
of the second last layer could be calculated using previously obtained values and so on,
going back to the first layers. This approach of propagating the error backward through
the network is called the backward propagation method [23] [24]. Not using this
backward propagation algorithm could easily slow down the gradient descent algorithm
by multiple magnitudes. The Theano package makes the implementation for this very
easy: the user just has to define the mathematical relation between the variables,
after which the package will calculate the gradient to all parameters using backward
propagation.
2.5.2 Other packages
There are many other packages for deep learning like Caffe and TensorFlow. Also many
other packages, like Lasagne and Nolearn, have been built on Theano for the purpose
of deep learning. These packages take care of the implementation of the network and
its gradient descent algorithm. While this is useful for commercial purposes, it does
not give the user much incentive to learn how and why these algorithms precisely work.
One of the purposes of the thesis was to get familiar with the concepts of the gradient
descent algorithm, implemented with two of its most important extensions: minibatch
and momentum. Because of this, it has been decided to code the thesis fully using
only the Theano package for the gradient descent.
Chapter 3
Results
3.1 Training the autoencoders
In this thesis, the autoencoder networks are trained using the gradient descent algo-
rithm. The gradient descent algorithm has no inherent endpoint: the convergence of
the algorithm has to be decided by hand or using certain convergence criteria like a
maximum run time. Keeping track of the loss of the network during the gradient de-
scent algorithm can be a useful tool to help with this task. This loss can be checked on
a validation set every certain amount of steps. This sequence of loss values can then
be plotted against the the number of steps. Visual inspection of such a loss function
plot can give a very good idea of the convergence of the gradient descent algorithm,
often much better than predefined convergence criteria. The loss function plots will be
used to compare the effect of different adaptations of the gradient descent algorithm
in the following subsections.
3.1.1 Adding extensions to the gradient descent algorithm
In the first plot of Figure 3.1 the loss function is shown for a simple autoencoder as a toy
model. The autoencoder has linear activation function and one hidden layer containing
two neurons. In other words, the autoencoder will learn to perform singular value
decomposition with two dimensions. After the initialisation of the network parameters,
the values of the bottleneck neurons will all be close to zero. Initially the gradient
descent algorithm will not be able to learn much, due to lack of structure in the
32
Chapter 3. Results 33
(a) Normal gradient descent (544 sec) (b) Dubbel learning rate (270 sec)
(c) Minibatches per 250 (68 sec) (d) Momentum with 0.5 inertia (296 sec)
Figure 3.1: The loss function of the gradient descent algorithm and some exten-sions. The run time is shown between brackets
network. This can be seen as a plateau on the loss function in the start, equal to a
loss of predicting no ingredients for each recipe. Once the gradient descent algorithm
has learned some structure of the data, the first principle component will be learned at
a fast pace, after which the algorithm will slow down again on a second plateau. At this
point, the second principle component has yet to be learned: one of the neuron values
is still close to zero or they have both almost the same value for all the observations.
Either case, the network is not using its full learning capacity. After escaping this
plateau, the gradient descent algorithm will learn the second principle component,
converging to the same loss of the singular value decomposition.
The time it took to train this simple network was pretty long: 554 seconds. In each
of the other plots of the figure the gradient descent algorithm has been improved.
The second plot has a double learning rate and converged in 270 seconds, which is
about half of the original time. Increasing the learning rate will generally speed up
the algorithm, however there is an upper boundary above which the algorithm might
Chapter 3. Results 34
not converge fully or even start to diverge. Minibatches have been used in the third
plot, yielding a convergence time of 68 seconds. Adding minibatches to the gradient
descent algorithm will speed up the calculation of the gradient in each step. In the
fourth plot momentum was added, yielding a convergence time of 296 seconds. Adding
momentum with 0.5 inertia was almost as useful as doubling the learning rate. This is
to be expected: doubling the learning rate will increase the stepsize in all directions,
while adding momentum with 0.5 inertia will double the stepsize in the directions
where the gradients do not change orientation, but diminish for the other directions.
Combining all these optimisations and using the best values will enable the autoencoder
to be fully trained in a few seconds. This is even faster than the analytical calculation
performing singular value decomposition which takes about 15 seconds.
Figure 3.2: Too much momentum can make the gradient descent algorithmunstable (left). A diverging loss function as a results of a too large learning rate
(right)
While the tools discussed above can speed up the algorithm, they can also cause
instability when too extreme parameters are chosen. In Figure 3.2 two examples of
this are shown. The first plot shows the convergence of the gradient descent algorithm
using too much momentum (inertia = 0.95). The algorithm has a lot of fluctuations
up and down before settling to the convergence value. This could have ended even
worse, with a diverging loss function. An example of this is shown in the second plot,
where a too large learning rate was chosen. After some fluctuation, the loss shoots
up. The next values of the loss are not shown on the plot because they grew to the
upper boundary of the numpy floats within the three following steps.
In Figure 3.3 the loss function for a deep autoencoder is shown. The gradient descent
algorithm starts off really fast but slows down rapidly. The loss of the deep autoencoder
almost instantly surpasses the loss value 0.066, which is the loss of a singular value
Chapter 3. Results 35
Figure 3.3: The loss function of a complex network for the training data andvalidation data.
decomposition with the same dimension reduction and cost function. If we let the
gradient descent algorithm train for too long, the training loss might start to dip under
the validation loss for models with a lot of parameters. This did not happen for the
models of this thesis: there was a lot of data and the models did not have too many
parameters in order for overfitting to occur.
3.1.2 Plateaus
In a linear network with least squares cost function, the loss will always be convex,
guaranteeing the gradient descent algorithm to always be able to escape the learning
plateaus with enough small steps. This is not the case for a general loss function. For
a non-linear network it is possible for the gradient descent algorithm to get stuck on
certain plateaus in the loss function. In Figure 3.4 the bottleneck neurons for several
types of plateaus are shown. The gradient descent algorithm can get stuck when all
the bottleneck neurons have a value of zero or close to zero, as is the case for the
first plot of the figure. In the second plot the gradient descent algorithm has learned
Chapter 3. Results 36
(a) (b)
(c)
Figure 3.4: The bottleneck neurons for a gradient descent algorithm getting stuckon a zero dimensional (left), one dimensional (right) or two dimensional (down)
subspace of the bottleneck neurons.
meaningful values for the first neuron but not for the second and is unable to escape
the plateau. In the third plot, the values of the three bottleneck neurons are stuck on
a two-dimensional subspace. The gradient descent algorithm can get stuck in these
plateaus if the hyperparameters are not properly chosen.
3.2 Comparing data reduction methods
3.2.1 Singular value decompostion
In Figure 3.5 biplots of the first two components of singular value decomposition and
its variants are shown. While SVD captures the directions of greatest variation around
Chapter 3. Results 37
zero, PCA captures the directions of greatest variation within the dataset, which is
more useful. When the data gets centered around zero, SVD will learn the same
principle component directions as PCA. Both SVD and PCA are also very sensitive
to the normalisation. Without normalisation the algorithms will try to model the
frequencies of the ingredients and the number of ingredients per recipe. By normalising
the data, these factors are eliminated and more meaningful features of the data will
be learned. To check how well the reduced features maintain the structure of the
data, the recipes are colored by their regions. As expected from the discussion above,
PCA with normalisation seems to preserve the structure the best: recipes with similar
regions lie more together.
(a) SVD (b) SVD with normalisation
(c) PCA (d) PCA with normalisation
Figure 3.5: Biplots of different variations of singular value decomposition
3.2.2 Autoencoders
An autoencoder can learn to perform singular value decomposition when linear acti-
vation functions and a least squares cost function are chosen. When such a model is
Chapter 3. Results 38
fully trained, the bottleneck neurons of the autoencoder should have the same values
as the reduced dimensions of SVD. In Figure 3.6 the two features of SVD are shown
as well as the two bottleneck neurons of an autoencoder that has learned to perform
SVD. The autoencoder manages to reproduce the results of SVD almost perfectly,
aside from some scaling factors.
Figure 3.6: Two dimensional features of SVD (left) versus the two bottleneckneurons of a linear autoencoder with a least squares cost function (right).
Unlike SVD, autoencoders do not need centering and normalisation of the data to
work well. This is because the autoencoder can simply add a bias term to the layers
of its network. This bias term will model the frequencies of the ingredients and the
number of ingredients per recipe. The bottleneck neurons are then free to represent
more meaningful features of the data. Centering and normalisation of the data is also
not preferred for the autoencoder: the values of the input variables will not be fixed on
zero or one anymore. The sigmoid activation function with cross entropy cost function
can then not be used for the output neurons. This will make the network harder to
train, as well as losing the meaning of the values of the input and output layers as the
ingredients being present or not in the recipe.
Chapter 3. Results 39
Figure 3.7: Biplots of the bottleneck neurons of non-linear autoencoders withone hidden layer (left) and three hidden layers (right).
In a complex dataset with a lot of observations, autoencoders can potentially perform
better than SVD by modelling more structure of the data. This is done by adding
more layers and using non-linear activation functions. In Figure 3.7 the bottleneck
neurons of non-linear autoencoders with one and three hidden neurons are shown.
The autoencoder with one hidden layer has cross entropy loss of 0.062, which is better
than 0.066, the cross entropy loss of SVD. This model does better than the first three
plots in Figure 3.5 in preserving the structure, as a result of the added bias term. The
PCA with normalisation still seems to outperform this autoencoder network. Adding
an extra hidden layer on both sides of the bottleneck layer gives the autoencoder shown
on the right side of the plot. This autoencoder has a cross entropy cost of 0.055, much
lower than both the SVD and the autoencoder with one hidden layer. It also seems to
better maintain the structure of the data compared to all the other models.
Figure 3.8: Biplots of the bottleneck neurons of autoencoders with five hiddenlayers. For the left plot a linear activation function was used to the bottleneck layer,
while for the right plot a sigmoid activation function was used.
Chapter 3. Results 40
Adding two more hidden layers improved the performance even further. In Figure 3.8
the bottleneck neurons of autoencoders with five hidden layers are shown. For the left
plot a linear activation function was used to the bottleneck layer, which resulted in
a cross entropy loss of 0.051. For the right plot a sigmoid function was used, which
resulted in a cross entropy loss of 0.048. Both models improved the loss by a lot
compared to the models with less layers. Models with rectifier activation functions
to the bottleneck layer were also tried out, but they were visually not as satisfying.
Adding more layers was also explored, but this did not improve the results by much
compared to five hidden layers.
3.3 Prediction of the regions
In Figure 3.9 the features of the PCA model with normalisation for the training and
testing are shown, as well as the KNN and QDA predictions.
(a) Training data (b) Test data
(c) KNN predictions (d) QDA predictions
Figure 3.9: Biplots of the features of PCA with normalisation.
Chapter 3. Results 41
The recipes of North American region take up 73.4% of the dataset and have recipes
similar to recipes from all over the world. These recipes would dominate the prediction
of regions of all the recipes, which would not be very useful. For this reason, the North
American recipes are removed for the prediction part, but not for the training of the
models. On the reduced features of PCA with normalisation, the KNN classifier has
a prediction accuracy of 55.4%, while the QDA classifier has a prediction accuracy of
55.2%
(a) Training data (b) Test data
(c) KNN predictions (d) QDA predictions
Figure 3.10: The datasets for region prediction on the two bottleneck neuronsof the deep autoencoder model with a linear activation function to the bottleneck
layer.
In Figure 3.10 the features of a deep autoencoder with a linear activation function to
the bottleneck layer for the training and testing are shown, as well as the KNN and
QDA predictions. The KNN classifier has a prediction accuracy of 65.0%, while the
QDA classifier has a prediction accuracy of 57.8%
In Figure 3.11 the features of a deep autoencoder with a linear activation function to
the bottleneck layer for the training and testing are shown, as well as the KNN and
Chapter 3. Results 42
(a) Train (b) Test
(c) knn (d) qda
Figure 3.11: The datasets for region prediction on the two bottleneck neuronsof the deep autoencoder model with a linear activation function to the bottleneck
layer.
QDA predictions. The KNN classifier has a prediction accuracy of 65.4%, while the
QDA classifier has a prediction accuracy of 58.2%. Both deep autoencoder models
outperformed the prediction accuracies of the PCA model with normalisation.
The performance on the raw dataset was also measured. KNN gave the best perfor-
mance with a prediction accuracy of 69.8%. This is very close to the KNN prediction
values of the deep autoencoder models with two bottleneck neurons, while not so
close to the KNN prediction of the PCA with normalisation. This suggests that deep
autoencoders retain much more structure of the data when reducing the dimensions.
As a side experiment, some models were tested with a higher number of bottleneck
neurons, from which a model with 100 bottleneck neurons was selected as the model
with the highest prediction accuracy on the validation dataset. The prediction accu-
racy of this model on the test data was 72.0%. This suggests that deep autoencoder
Chapter 3. Results 43
models can also be useful for representation learning, although the benefit compared
to the raw dataset was minimal.
3.4 Collaboratorive filtering for recipe creation
Several autoencoder architectures have been explored for collaborative filtering. It was
found that deeper models performed better. Using sigmoid activation functions instead
of rectifier activation functions also improved the performance. The author noted also
something different: models that were not fully trained appeared to be better than
models for which the gradient descent algorithm had fully converged. Even though
the fully trained models were not overfitted: the fully trained models had the lowest
cross validation loss and this loss was very close the the training loss. The author does
not see any reason why the not fully trained models performed better. From all the
different models, the one performing the best on the validation dataset was used on
the test dataset.
As explained in Chapter 2, the recipes are modified by randomly either removing or
adding an ingredient as a way to measure the performance. If the autoencoder works
well for collaborative filtering, it should be able to find back the changed ingredient.
For the recipes with a removed ingredient, the ingredients not present in the adapted
recipe will be ranked on how well they would fit the adapted recipe. For the recipes
with an added ingredient, the ingredient of the adapted recipe will be ranked on how
bad they fit the adapted recipe. In both cases, a low rank means the autoencoder did
well in finding back the changed ingredient.
3.4.1 Reconstruction of the removed ingredient
In Table 3.1 an example of recipe retrieval is shown for one of the test recipes. The
original recipe contains the following ingredients: cocoa, cream cheese, eggs, milk,
wheat and vanilla. This recipe has been modified by removing vanilla. The modified
recipe has been put through the network, resulting in reconstruction values for all
ingredients. From ingredients not included in the modified recipe, the five with highest
reconstruction values were selected. The missing ingredient vanilla ranks second with
a reconstruction value of 55%. For this recipe, the model did very well in retrieving the
missing ingredient. Expecting a perfect retrieval is not reasonable: other ingredients
Chapter 3. Results 44
might also combine well with the adapted recipe. The other suggestions in table would
indeed be good combinations with the adapted recipe.
Ingredients cocoa, cream cheese, eggs, milk, wheatSuggestions to add cream vanilla butter yeast vegetable oilReconstruction % 58 55 35 25 20
Table 3.1: The first row contains the ingredients of a recipe from which vanillawas removed. The following row contain the top five of suggestions of ingredientsto add to the adapted recipe, with their reconstruction values in the last row. The
removed ingredient vanilla was ranked second of the suggestions.
In Table 3.2 the performance measures for ingredient retrieval of the autoencoder are
shown and compared to the two models of De Clercq et al. [2]. The deep autoencoder
has mean rank of 25.2 and median rank of 8 for the removed ingredient. This per-
formance is very good: randomly selecting ingredients would result in ranks uniformly
distributed between one and the number of ingredients not in the recipe (there are 381
ingredients in total). Compared to the models of De Clercq et al., the autoencoder
outperforms the non-negative matrix factorisation model and comes close in perfor-
mance to the two-step kernel ridge regression model. Deep autoencoder can thus be
used as an alternative method for collaborative filtering.
Performance measure mean rank median rank % with rank ≤ 10Deep autoencoder 25.2 8 54.5NMF 33.0 12 48.2Two-step KRR 23.6 7 59.1
Table 3.2: Comparing the rank of ingredient reconstruction for the deep autoen-coder, non-negative matrix factorization (NMF) and two-step kernel ridge regression
(Two-step KRR) models.
3.4.2 Elimination of the added ingredient
In Table 3.3 an example of elimination of an added ingredient is shown for one of the
test recipes. The original recipe contains the following ingredients: milk, coffee and
cocoa. This recipe has been modified by adding mustard. The modified recipe has
been put through the network, resulting in the reconstruction values of the ingredients
shown in the table. The added ingredient mustard ranks lowest with a reconstruction
value of 3%. The model did very well in retrieving the added ingredient.
In Table 3.4 the performance measures for ingredient elimination of the autoencoder
are shown. The model has mean rank of 1.5, a median rank of 1 for the removed
Chapter 3. Results 45
Ingredients mustard cocoa coffee milkReconstruction % 3 84 94 96
Table 3.3: A recipe for cappuccino. Mustard has been randomly added as extraingredient and has a very bad reconstruction unlike the other ingredients. Themodel predicts mustard as the first ranked ingredient to eliminate from the recipe.
ingredient and eliminated the correct ingredient 80% of the time. Eliminating an
added ingredient is much easier than retrieving a removed ingredient, since it only
has to pick from the ingredients that are in the adapted recipe. Nevertheless, this
performance is very good.
Performance measure mean rank median rank % first rankDeep autoencoder 1.5 1 78.8
Table 3.4: The performance measures of the elimination of a randomly addedingredient.
Chapter 4
Conclusion and discussion
4.1 Conclusion
In the thesis, it was explored how to train deep autoencoder networks on cooking
recipes using the gradient descent algorithm. Since it can be very hard to train deep
autoencoders [1], several extensions were added to improve the gradient descent algo-
rithm. Adding minibatches and momentum to the gradient descent algorithm helped
to speed up the algorithm and improved the performance. Pretraining the network
with similar, easier to train networks prevented the algorithm from getting stuck in
plateaus with a high loss.
These deep autoencoders were then compared to singular value decomposition for the
purpose of data reduction of the ingredients of recipes to two dimensions. To measure
the performance, the cross entropy loss for the reconstruction of the ingredients was
used. Singular value decomposition had loss of 0.066 while the best deep autoencoder
performed much better with a loss of 0.048.
On the two reduced dimensions of all models, supervised machine learning was used
to predict the regions of the recipes. The supervised learning was done using two
algorithm: KNN and QDA, with KNN outperforming QDA on all models. On the SVD
model, the KNN algorithm has a prediction accuracy of 55.4%. On the deep autoen-
coder models, the KNN algorithm had a much better prediction accuracy: 65.0% for
the model with a linear activation function to the bottleneck neurons and 65.4% for
the model with a sigmoid activation function to the bottleneck neurons. Performing
KNN on the raw dataset resulted in a prediction accuracy of 69.8%, suggesting that
46
Chapter 4. Conclusion and discussion 47
the two bottleneck neurons of the deep autoencoders maintained the structure of the
regions very well. Performing KNN on the reduced features of a deep autoencoder
with 100 bottleneck neurons gave a prediction accuracy of 72.0%, suggesting deep
autoencoders might have some usefulness for representation learning on the dataset.
Separate deep autoencoder models were trained for collaborative filtering with the
purpose of making recommender systems. The performance of these recommender
systems were tested by the ranks of the recommendations of the ingredients that were
randomly either removed from or added to an recipe. De Clercq et al. [2] have build
two similar recommender models on the same dataset. As can be seen in Table 4.1,
the deep autoencoder outperforms the non-negative matrix factorization and comes
close to the two-step kernel ridge regression.
Performance measure mean rank median rank % first rank % top 10Reconstruction DAE 25.2 8 / 54.5Reconstruction NMF 33.0 12 / 48.2Reconstruction two-step KRR 23.6 7 / 59.1Elimination DAE 1.5 1 78.8 /
Table 4.1: The performance measures of reconstruction of a randomly removed in-gredient for the deep autoencoder (DAE), non-negative matrix factorization (NMF)and two-step kernel ridge regression (two-step KRR) models. The elimination per-
formance of the added ingredients are also included.
4.2 Discussion
To improve the gradient descent algorithm even further, a variant of an adaptive
learning rate could be implemented, rather than using a fixed learning rate. Also, only
a certain number of deep autoencoder architectures were explored in the thesis. All
the deep autoencoders models had very little overfitting, since the train and validation
loss were always very close. It is possible that deep autoencoder architectures with
more parameters might fit the data better, although these types of complex models
will likely be even harder to train. The different performance measures explored in the
thesis could be further improved with those models. This could be explored in further
research.
Appendix A
Admission for circulating the work
The author, promoter and co-promotor give permission to consult this master dis-
sertation and to copy it or parts of it for personal use. Each other use falls under
the restrictions of the copyright, in particular concerning the obligation to mention
explicitly the source when using results of this master dissertation.
48
Bibliography
[1] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of
data with neural networks. Science, 313(5786):504–507, 2006.
[2] Marlies De Clercq, Michiel Stock, Bernard De Baets, and Willem Waegeman.
Data-driven recipe completion using machine learning methods. Trends in Food
Science & Technology, 49:1–13, 2016.
[3] Yong-Yeol Ahn, Sebastian E Ahnert, James P Bagrow, and Albert-Laszlo
Barabasi. Flavor network and the principles of food pairing. Scientific reports, 1,
2011.
[4] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George
Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershel-
vam, Marc Lanctot, et al. Mastering the game of go with deep neural networks
and tree search. Nature, 529(7587):484–489, 2016.
[5] Douglas B Lenat and Ramanathan V Guha. Building large knowledge-based sys-
tems; representation and inference in the Cyc project. Addison-Wesley Longman
Publishing Co., Inc., 1989.
[6] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press,
2016. http://www.deeplearningbook.org.
[7] Shlomo Mor-Yosef, Arnon Samueloff, Baruch Modan, Daniel Navot, and Joseph G
Schenker. Ranking the risk factors for cesarean: logistic regression analysis of a
nationwide study. Obstetrics & Gynecology, 75(6):944–947, 1990.
[8] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional
networks. In European conference on computer vision, pages 818–833. Springer,
2014.
49
Bibliography 50
[9] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm
for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
[10] Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle, et al. Greedy
layer-wise training of deep networks. Advances in neural information processing
systems, 19:153, 2007.
[11] Frank Seide, Gang Li, and Dong Yu. Conversational speech transcription using
context-dependent deep neural networks. In Interspeech, pages 437–440, 2011.
[12] Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray
Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from
scratch. Journal of Machine Learning Research, 12(Aug):2493–2537, 2011.
[13] Ruslan Salakhutdinov and Geoffrey Hinton. Semantic hashing. International
Journal of Approximate Reasoning, 50(7):969–978, 2009.
[14] Andrew Nguyen. Machine learning course on coursera. https://www.coursera.
org/learn/machine-learning. Accessed: 2017-01-23.
[15] Sepp Hochreiter. The vanishing gradient problem during learning recurrent neural
nets and problem solutions. International Journal of Uncertainty, Fuzziness and
Knowledge-Based Systems, 6(02):107–116, 1998.
[16] Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pas-
cal Vincent, and Samy Bengio. Why does unsupervised pre-training help deep
learning? Journal of Machine Learning Research, 11(Feb):625–660, 2010.
[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into
rectifiers: Surpassing human-level performance on imagenet classification. In
Proceedings of the IEEE international conference on computer vision, pages 1026–
1034, 2015.
[18] D Randall Wilson and Tony R Martinez. The general inefficiency of batch training
for gradient descent learning. Neural Networks, 16(10):1429–1451, 2003.
[19] Boris T Polyak. Some methods of speeding up the convergence of iteration
methods. USSR Computational Mathematics and Mathematical Physics, 4(5):
1–17, 1964.
Bibliography 51
[20] Divergence of the gradient descent algorithm with a too large learning rate on
a parabolic loss function. http://www.cs.cornell.edu/courses/cs4780/
2015fa/web/lecturenotes/lecturenote07.html. Accessed: 2017-01-23.
[21] The effect of the different learning rates on the convergence of
the loss function with the gradient descent algorithm. https:
//leonardoaraujosantos.gitbooks.io/artificial-inteligence/
content/more_images/learningrates.jpeg. Accessed: 2017-01-23.
[22] James Bergstra, Olivier Breuleux, Frederic Bastien, Pascal Lamblin, Razvan Pas-
canu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Ben-
gio. Theano: A cpu and gpu math compiler in python. In Proc. 9th Python in
Science Conf, pages 1–7, 2010.
[23] Martin Fodslette Møller. A scaled conjugate gradient algorithm for fast supervised
learning. Neural networks, 6(4):525–533, 1993.
[24] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E
Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to
handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.