exploration of deep autoencoders on cooking...

Ghent University

Master Thesis

Exploration of deep autoencoders oncooking recipes

Author:

Lander Bodyn

Tutor:

Ir. Michiel Stock

Promoter:

Prof. Dr. Christophe Ley

Co-promoter:

Prof. Dr. Willem Waegeman

A thesis submitted in partial fulfilment of the requirements

for the degree of Master of Science in Computational Statistics

Department of Applied Mathematics, Computer Science and Statistics

January 2017

http://www.ugent.be/en

GHENT UNIVERSITY

AbstractMaster of Science in Computational Statistics

Exploration of deep autoencoders on cooking recipes

by Lander Bodyn

Deep autoencoders are a form of deep neural networks that can be used to reduce the

dimensionality of datasets. These deep autoencoder networks can sometimes be very

hard to train [1]. The gradient descent algorithm has been explored to train deep au-

toencoders on a dataset of cooking recipes. Minibatches, momentum and pretraining

were added as extensions to improve the gradient descent algorithm. The performance

of data reduction to two dimensions of the deep autoencoders was compared to singu-

lar value decomposition. The best deep autoencoder model obtained a cross entropy

loss of 0.048, much lower than cross entropy loss of 0.066 for singular value decomposi-

tion. From the two reduced dimension, the regions of the recipes were predicted using

the KNN and QDA algorithms. For the deep autoencoder models, the best predic-

tion accuracy was 65.4%, outperforming the best prediction accuracy of singular value

decomposition, 55.4%. The best prediction accuracy of the raw dataset was 69.8%,

suggesting that the deep autoencoders maintain the structure of the regions very well

in two dimensions. Using a deep autoencoder with data reduction to 100 dimensions,

the prediction accuracy was 72.0%, suggesting deep autoencoders might have some

usefulness for representation learning on this dataset. Dimensionality reduction tech-

niques can also be used as recommender systems, using collaborative filtering. Deep

autoencoder models were optimized to have the best retrieval rank of an ingredient

that was either removed from or added to an existing recipe. De Clercq et al. [2] have

built two similar recommender models on the same dataset: a non-negative matrix

factorization and a two-step kernel ridge regression model. The deep autoencoder

(mean rank = 25.2) outperforms the non-negative matrix factorization (mean rank =

33.0) and comes close in performance to the two-step kernel ridge regression (mean

rank = 23.6).

http://www.ugent.be/en

Acknowledgements

I would first like to thank my promoter Prof. Dr. Christophe Ley from the Faculty of

Sciences at Ghent University. He instantly accepted my plan to do a thesis in deep

learning and helped me by proofreading many parts of the thesis, suggesting which

parts I should explain more clearly. While my promoter is not accustomed to the field

deep learning, he assisted me to find a co-promoter that could guide me in the practical

parts of the thesis.

This brought me to my co-promoter Prof. Dr. Willem Waegeman and supervisor Ir.

Michiel Stock from the Faculty of Bioengineering at Ghent University. I would like to

thank both for coming up with a very interesting thesis subject, proof-reading several

parts of the thesis and continually guiding me during the thesis.

I also want to thank my friend Giancarlo Kerg, who inspired me to start my master

in Computational Statistics, as a foundation to move towards the field of machine

learning and deep learning in specific.

I also want to thank the company Yazzoom, where I could do an internship in deep

learning during my thesis. Some of the skills I learned at Yazzoom helped me to make

progress in my thesis.

Finally, I want to thank my family, who have supported me throughout the whole

process. Special thanks to my grandma, who made my lunch and dinner every day

and my parents, who endured all the fluctuations in my mood during the writing of

my thesis.

iii

Contents

Abstract ii

Acknowledgements iii

Contents iv

1 Introduction 1

1.1 Theoretical background . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 The cooking recipes dataset . . . . . . . . . . . . . . . . . . . . . . . 2

2 Methods 4

2.1 From artificial intelligence to deep learning . . . . . . . . . . . . . . . 4

2.1.1 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.2 Representation learning . . . . . . . . . . . . . . . . . . . . . 6

2.1.3 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Deep autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Network architecture . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.2 Singular value decomposition for dimension reduction . . . . . 12

2.2.3 Deep autoencoders for dimension reduction . . . . . . . . . . . 13

2.2.4 Deep autoencoders for representation learning . . . . . . . . . 15

2.2.5 Deep autoencoders for collaborative filtering . . . . . . . . . . 18

2.3 Training the network with gradient descent . . . . . . . . . . . . . . . 19

2.3.1 Local minima . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3.2 The vanishing gradient problem . . . . . . . . . . . . . . . . . 21

2.3.3 Initialisation of the network parameters . . . . . . . . . . . . . 23

2.3.4 Minibatch gradient descent . . . . . . . . . . . . . . . . . . . 23

2.3.5 Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 Optimisation of the hyperparameters . . . . . . . . . . . . . . . . . . 25

2.4.1 The gradient descent hyperparameters . . . . . . . . . . . . . 26

2.4.1.1 Learning rate δ . . . . . . . . . . . . . . . . . . . . 27

2.4.1.2 Batchsize . . . . . . . . . . . . . . . . . . . . . . . 29

2.4.1.3 Inertia α . . . . . . . . . . . . . . . . . . . . . . . . 29

2.4.1.4 Initialisation range . . . . . . . . . . . . . . . . . . . 29

iv

Contents v

2.4.2 The network architecture . . . . . . . . . . . . . . . . . . . . 30

2.5 Python and the Theano package . . . . . . . . . . . . . . . . . . . . 30

2.5.1 Backward propagation of the gradient . . . . . . . . . . . . . . 30

2.5.2 Other packages . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 Results 32

3.1 Training the autoencoders . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1.1 Adding extensions to the gradient descent algorithm . . . . . . 32

3.1.2 Plateaus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2 Comparing data reduction methods . . . . . . . . . . . . . . . . . . . 36

3.2.1 Singular value decompostion . . . . . . . . . . . . . . . . . . . 36

3.2.2 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3 Prediction of the regions . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.4 Collaboratorive filtering for recipe creation . . . . . . . . . . . . . . . 43

3.4.1 Reconstruction of the removed ingredient . . . . . . . . . . . . 43

3.4.2 Elimination of the added ingredient . . . . . . . . . . . . . . . 44

4 Conclusion and discussion 46

4.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

A Admission for circulating the work 48

Bibliography 49

Chapter 1

Introduction

1.1 Theoretical background

An autoencoder is a type of artificial neural network. When a neural network has

several hidden layers, the network is called a deep network. The gradient descent

algorithm is currently the dominant way of training neural networks. It can however

sometimes be difficult to train neural networks using the gradient descent algorithm;

this is especially true for deep autoencoders [1].

Autoencoders are designed to reduce the dimensionality of the dataset while minimizing

a reconstruction error. They can be seen as a non-linear extension of the linear data

reduction method singular value decomposition. Data reduction methods are useful to

obtain visualisations of the data in two or three dimensions. Dimensionality reduction

has also other applications. For example, the reduced features can be more suitable

for a machine learning task than the original features.

Since the increase in availability of datasets of cooking recipes online, machine learning

is starting to play a prominent role in tasks such as food preference modelling. Having

an algorithm that could combine left over ingredients to create a good recipe would

be a useful application. De Clercq et al. [2] built two such recommender systems on

a dataset containing the ingredients of recipes. For the recommender systems, the

authors used a non-negative matrix factorization model and a two-step kernel ridge

regression model. Deep autoencoders can also be used as a recommender system: in

order to reduce the ingredients of the recipes, meaningful features of the recipes will

have to be learned. A selection of ingredients can be reconstructed by the autoencoder,

1

Chapter 1. Introduction 2

after which the selection will resemble the recipes from which the autoencoder has

learned its parameters.

1.2 Overview of the thesis

In the thesis, it was explored how deep autoencoders can be optimally trained with the

gradient descent algorithm on a dataset of cooking recipes. To speed up the gradient

descent algorithm and improve its performance, two extensions were added to the

algorithm: minibatches and momentum. Aside from improving the gradient descent

algorithm itself, pretraining of the network parameters was implemented as another

tool to facilitate convergence to a good solution.

The deep autoencoder models were compared to singular value decomposition (SVD)

for the purpose of data reduction. The performance of the models was measured using

a reconstruction error. It was also examined how well both methods maintained the

structure of the data, by visually checking if recipes with similar regions of origin lay

close together on the biplots of the reduced features. The regions of the recipes were

then predicted, using the KNN and QDA algorithms on the reduced features. This

predicting might even be better than prediction models using the original dataset:

data reduction algorithms can possibly make the data more suitable for the prediction

task. The thesis also explored the use of deep autoencoder models as recommender

systems. The same dataset and performance measures of De Clerq were used, to

enable comparison with their recommender models [2].

1.3 The cooking recipes dataset

The data for the thesis was obtained from Ahn et al. [3]. Recipes with less than three

ingredients were removed as was done for the recommender systems of De Clercq et

al. [2], in order to enable comparison with the autoencoder recommender system. The

cleaned dataset contains 55001 different recipes using 381 ingredients. Each recipe

was represented by a binary vector: ones denote the presence of ingredients, zeros

denote the absence of ingredients. As each recipe contains only a small selection of

the ingredients, the data-matrix is a sparse matrix with a filling degree of 2.16%.

Chapter 1. Introduction 3

The region of origin of the recipes is included in the dataset. In total there are eleven

regions. Recipes of North American origin are the largest category, taking up 73,4%

of the recipes. Due to the short history of the North American recipes, most recipes

of North American origin are imported versions of recipes from all over the world.

Because of this, the North American recipes will be removed for the prediction of the

region of origin but not for the training of the autoencoder.

Of the 55001 recipes, 2500 will be set aside for validation and 2500 for testing. The

remaining recipes are used for training. The validation set will be used to determine the

convergence of the gradient descent algorithm as well as to select the optimal model

for the collaborative filtering. The test set will be used to test the performance of the

collaborative filtering. For the prediction of the origin, the dimensions of the whole

dataset (without the North American recipes) will be reduced with the autoencoder.

This dataset will then be split in 70% training data and 30% test data for the supervised

machine learning algorithms. There is no problem in using the training and validation

data of the autoencoder for the supervised machine learning problem: the autoencoder

is an unsupervised algorithm that does not require the values of the regions for training.

Chapter 2

Methods

2.1 From artificial intelligence to deep learning

Artificial intelligence (AI) is the field of making computers intelligent. Since pro-

grammable computers were first conceived, people have been wondering whether such

machines might become intelligent. In the early days of artificial intelligence, rapid

progress was made in logical tasks that can be easily defined with mathematical rules.

Since humans are typically not very good at these tasks, it did not take very long

before the computer started to outperform humans at those tasks. One such a task

is the logical board game chess, in which the IBM supercomputer Deep Blue defeated

the former world champion Gary Kasparov in 1997.

Ironically, many tasks that seem trivial for humans, like processing visual and auditive

information, are very hard for a computer to solve. The real challenge in artificial

intelligence turned out to be solving these kinds of intuitive problems. It is only

recently that some of these problems have been solved: for example in march 2016 did

the AlphaGo program of DeepMind manage to defeat Lee Sedol, the world champion

of go [4]. Go is a board game similar to chess, but with a lot more board positions.

Being a good player at go requires a lot of spatial and intuitive thinking, a task that

is very hard to write down in logical rules.

There have been many approaches to solving the challenges in artificial intelligence.

One of them is the knowledge-based approach. The knowledge-based approach tries

to make computers intelligent by hard coding different knowledge rules by hand. An

example of this is the Cyc project, which has as goal to enable AI applications of

4

Chapter 2. Methods 5

human-like reasoning [5]. These types of efforts have not been very fruitful however.

It turned out to be very hard to compose logical rules that capture all of the complexity

of human reasoning. In Figure 2.1 the relation between AI and several of its subfields

are shown. These subfields will be discussed in the following subsections.

Figure 2.1: A Venn-diagram explaining the relation between artificial intelligence,machine learning, representation learning and deep learning. For each subfield an

exclusive example is given [6].

2.1.1 Machine learning

Machine learning was invented as a different approach to artificial intelligence. Instead

of trying to hard-code everything, the programmer will define algorithms by which

the computer can extract its own logical rules from given data. Nowadays, machine

learning is used everywhere and has many applications. For example, logistic regression

can determine whether to recommend cesarean delivery [7]. Within machine learning,

there are two main categories of algorithms: supervised and unsupervised algorithms.


With supervised algorithms, the goal is to make a prediction about a desired outcome

variable, given some input variables (features). The algorithm will do this by modelling

the factors in the data that are responsible for variation in the outcome variable. In

order to learn the optimal parameters of the model, the algorithm will need some

training data to learn from. The parameters will be learned to give the best predictions

of the outcome variables. Since the training dataset is only a sample of the true data

generating process, it will contain some random fluctuations. If there is not enough

data or if the model is too complex (has a lot of parameters), it is possible that the

algorithm will model some of these random fluctuations. This is called overfitting and

is not desired: modelling the random fluctuations in the training set will not generalise

to new data. In order to obtain a real measure of the performance of the algorithm, the

model has to be tested on a separated data set, called the test data. The difference in

performance between the train and test dataset can help to decide on the complexity

of the model to prevent overfitting. Supervised machine learning algorithms are often

split into two types, depending on the outcome variable. If the desired outcome value

of a supervised algorithm is discrete, one will speak of a classification problem. For a

continuous outcome variable the algorithm is called a regression problem.

Unsupervised algorithms do not have a desired outcome variable. Instead, these

types of algorithms try to find structure in the data. Unsupervised algorithms will for

example try to find clusters in the data or try to reduce the number of dimensions in

the data set.

In all machine learning algorithms, the performance of the algorithms relies heavily on

the construction of relevant features. Raw data, like the pixels of a picture, might not

contain much correlation with the desired outcome. It is only by designing intelligent

features that such algorithms gain a lot of power. However, the creation of these

features can be very complex and time consuming for the programmer.

2.1.2 Representation learning

In machine learning, the field of representation learning will not only use algorithms to

learn a desired outcome from some hand-crafted features, but will also use algorithms

to learn preferable features for the given task. For this purpose unsupervised learning

algorithms like singular value decomposition and shallow autoencoders can be used.

These algorithm will reduce high dimensional data to meaningful features, after which

these features can be used for a supervised machine learning task.


The thesis will explore the use of deep autoencoders. While deep autoencoders are a

type of unsupervised algorithms that can be used for representation learning, they are

also a type of deep learning algorithm. Shallow autoencoders on the other hand can

be used for representation learning but are not a part of the deep learning algorithms,

as shown in the Venn diagram of Figure 2.1. The different between shallow and deep

algorithms will be explained in the next subsection.

2.1.3 Deep learning

In representation learning, the computer will learn relevant features from the data and

use these features to predict the outcome. In deep learning, several layers of features

will be stacked on top of each other, in order to create much more complex features,

which can then be used to predict the outcome. The term ‘deep’ refers to the depth

of the layers of features that are built upon each other. In certain complex artificial

intelligence tasks, this approach can be very powerful.

In Figure 2.2 an example is shown of deep learning applied to object recognition in

images. The pixels of the image are given as the input layer. On top of the input layer,

several hidden layers are built from which eventually the type of object is predicted.

These layers are called hidden because they are not given as input or used as output,

instead they will be constructed by the algorithm itself. In the figure one can really see

what the computer is trying to learn: in the first hidden layer it will try to recognise

relevant low level objects like edges and color gradients. In the second hidden layer it

will use those objects to construct shapes like corners and contours. In the next layer

those shapes will be used to construct whole object parts. Finally, in the output layer

the object identity will be predicted from the object parts.

Another name often given to deep learning is artificial neural networks. This name

originates from some of the first implementations of deep learning algorithms in the

1940’s. Back then, researchers were using these types of algorithms in neuroscience

as computational models to learn how our own brain works. The researchers were

in fact trying to simulate the algorithm that the human brain uses to learn on the

computer. From an artificial intelligence perspective, it also makes sense to study

these models, since we know they can produce intelligence in humans. Nowadays

neuroscience has a diminishing influence on the progresses in deep learning research.

A lot of the terminology of neuroscience models still exists today however, like the

word ‘neuron’ for the features in the different layers.


Figure 2.2: Visualisation of a convolutional neural network, each layer buildingon the features of the previous layer [8].

Although the field of deep learning has been around for a long time, it has only recently

become very popular. Progress in computing power, together with big amounts of data,

has made it possible for deep learning algorithms to outperform other simpler machine

learning algorithms on several AI tasks:

• On the MNIST digit image classification problem, deep learning managed to

break the supremacy of support vector machines [9] [10].

• Microsoft’s 2012 version of their audio and video indexing speech system (MAVIS)

based on deep learning managed to reduce the word error rate by about 30%

compared to state-of-the-art models based on Gaussian mixtures [11].

• In natural language processing, the SENNA software which has applications in

tasks such as language modelling, semantic role labelling and syntactic parsing,

approaches or surpasses the state-of-the-art on these tasks and is simpler and

much faster than traditional predictors [12].


2.2 Deep autoencoders

In the thesis deep autoencoders are explored on cooking recipes. Autoencoders are a

type of unsupervised machine learning algorithm: instead of trying to predict a certain

outcome, autoencoders will try to reconstruct their own inputs. If hidden layers have

more neurons than the input layer, the autoencoder can potentially learn the identity

function. Such a reconstruction is not very useful. However, if the network has at

least one hidden layer with a number of neurons lower than the input layer, the use of

an autoencoder becomes more interesting. In this case, the network will have to learn

a compact description of the data in such a way as to retain as much information as

possible, despite the reduced number of dimensions. Figure 2.3 shows a visualisation of

the structure of a deep autoencoder network. The hidden layer with the least number

of neurons is called the bottleneck hidden layer. The structure of an autoencoder will

often be symmetric, with the bottleneck hidden layer in the middle. Although this is

not a strict rule, the reason behind this is very intuitive: if it requires a certain amount

of complexity (number of layers) to encode the inputs to the bottleneck layer, the

decoding back to the output layer would likely require a similar amount of complexity.

Figure 2.3: A visualisation of a deep autoencoder with a central bottleneck hiddenlayer. The output layer of an autoencoder tries to reconstruct his input layer.


2.2.1 Network architecture

In this subsection the elements of the architecture of an autoencoder network will be

discussed. As mentioned before, each layer is constructed from the previous layer. The

exact definition of this construction can be found in the equation:

yk = f

(∑i

wkixi + bk

). (2.1)

Each neuron yk of the next layer can be defined as a function f over the weighted

sum of the neurons of the previous layer. The weighted sum can also contain a bias

term bk. This is equal to saying that the previous layer has an extra neuron x0, with a

value identical to one. Note that for each neuron in the next layer, a different set of

weights wki is used, in order for each neuron to learn a different feature. If we have m

neurons in a layer and n neurons in the next layer, the number of network parameters

between the two layers will thus be given by mn or (m + 1)n when a bias term is

added. The function f is called the activation function. Some typical examples of

activation functions are:

• The linear function: f(x) = x;

• The sigmoid function: f(x) = 11+e−x ;

• The rectifier function: f(x) = max(x, 0).

The activation function of a neural network is often very simple. The complexity is

not generated by using very complex functions, but rather by combining several simple

functions to build layers of features on top of each other, each layer increasing in

complexity. There are however important differences between the activation functions.

The linear activation function is a special case among the activation functions of neural

networks. One of the properties of a linear function is that the combination of two

linear functions is again a linear function. In a neural network, this means that the

addition of an extra linear layer to another linear layer will be equivalent to just one

linear layer. A network with only linear activation functions will thus not be able to

learn complex (non-linear) features. The whole concept of deep learning would not

work here, so other activation functions will have to be added to the network.


−10

−5

0

5

10

Linear

Function

−0.20.00.20.40.60.81.01.2

Function derivative

−0.20.00.20.40.60.81.01.2

Sig

moid

−0.050.000.050.100.150.200.250.30

−10 −5 0 5 10−2

02468

10

Rect

ifie

r

−10 −5 0 5 10−0.2

0.00.20.40.60.81.01.2

Figure 2.4: A plot of different activation functions and their derivatives.

The sigmoid function is another function often used in neural networks. It is a non-

linear function and can thus be used to build a deep network. It also has the nice

property of being a monotonically increasing function that maps [−∞,∞] to [0, 1]

as can be seen in Figure 2.4. In the data set of the thesis, the inclusion of different

ingredients in the recipes is coded as 0/1. We could use this knowledge about the data

to build an appropriate architecture for the autoencoder. By using a sigmoid activation

function for the output layer, the values of the output will be restricted to the interval

[0, 1]. This way, the autoencoder will have a much easier time reconstructing his input

values.

The rectifier function is another non-linear activation function and can thus be used to

build complex features, just like the sigmoid function. But unlike the sigmoid function,

the rectifier function has a property that is very useful to train neural networks: the

derivative of the function is non-vanishing for a large region of the parameter space

(all values x > 0). This is not the case for the derivative of the sigmoid function,

which is only significantly greater than zero for parameter values close to zero. This

will be very useful to prevent the vanishing gradient problem, which will be discussed

in the next section.


The output of the network as function of the input layer is then defined as the chaining

of the activation functions of the different layers. On the output layer a loss function

will be defined. This loss function is used to optimize the network for the task at hand.

The loss function J(θ) of a general neural network is given by:

J(θ) =1

n

n∑i=1

L(g(x(i);θ),y(i)). (2.2)

In an autoencoder, the network will try to reconstruct its own inputs: x(i) = y(i). In

this equation, g is the chaining of the activation functions over the different layers.

The function L is called the cost function. The cost function determines how the

deviations from the values are penalized. The two cost functions explored in the thesis

are:

• The least squares function: L(y,y) =p∑j=1

(yj − yj)2

• The cross entropy function: L(y,y) =p∑j=1

yj log yj + (1− yj) log(1− yj)

The least squares function is one of the most used cost functions in machine learning,

many of the algorithms optimize this cost function. However when the outcome is

restricted to the interval [0, 1] by the use of a sigmoid activation function, it makes

much more sense to use the cross entropy. With a least square cost function, there will

not be much difference in cost between an output of 0.01 and 0.0001 while this is a big

difference for the arguments of the sigmoid function. The cross entropy will penalize

these differences exponentially, giving much more information to learn from, when

optimizing the network. This will be important for the vanishing gradient problem

which is discussed in the next chapter.

2.2.2 Singular value decomposition for dimension reduction

As stated above, when a bottleneck layer with a low number of layers is introduced

in the autoencoder, the reconstruction will in general not be perfect anymore. The

autoencoder can then be used to reduce the dimensions of the data while maintaining

as much information as possible. Within machine learning, there is another set of

techniques famous for being able to reduce data: principle component analysis (PCA),

which is a version of singular value decomposition (SVD). The SVD on the data matrix


M with n rows representing the observations and p columns representing the features

is given by the equation:

M = UDV T . (2.3)

with D a diagonal matrix with the non-negative singular values on the diagonal, ordered

from large to small. The singular values represent the amount of variation of the data

in their corresponding direction. The matrices U and V are called the left-singular and

right-singular matrices of M respectively. The columns of the matrix V span the space

of the decomposed features. If we write the matrix containing the first k columns of

V by Vk, we can reduce our dataset by projecting the features onto the subspace Vk

using the equation:

Zk =MVk. (2.4)

Because the first k columns of V have the largest singular values of the decomposition,

the reduced features Zk will be the features that capture the most variation in the

dataset. This variation is the variation observed from the zero point. The data is

reconstructed by projecting the reduced features Zk back on the original p-dimensional

feature space using the equation:

Mk = ZkVTk . (2.5)

This reconstruction will be incomplete: the data points will now lie in a k-dimensional

subspace of the p-dimensional feature space. PCA is defined as the eigenvalue decom-

position of the data covariance matrix. PCA will transform the features to (linearly)

uncorrelated features. The decomposed features will be ranked in decreasing order of

variation. This variation is the variation observed within the data. When the data is

centered, all the data will vary around zero. In this case, SVD and PCA will lead to the

same principle component directions. Aside from centring, the SVD/PCA algorithms

are also very sensitive for the normalisation of the data.

2.2.3 Deep autoencoders for dimension reduction

If we choose to use a linear activation function in Equation 2.1, the next layer in

our network is defined as a linear transformation of the previous layer. Equations 2.4

and 2.5 of singular value decomposition are also both linear transformations. Fur-

thermore, the projection parameters Vk are optimized to retain the most variation

possible. This is equal to minimizing the Frobenius norm of V - Vk for a fixed k, with


the Frobenius norm given by:

||M ||F =

√√√√ n∑i=1

p∑j=1

M2ij. (2.6)

This equation is exactly the same as the loss defined by Equation 2.2 when using a least

squares cost function, apart from a constant factor. This shows that an autoencoder

with a least squares cost function, linear activation functions, no bias terms and a

bottleneck hidden layer with k neurons will do exactly the same thing as an SVD with

k dimensions!

Singular value decomposition can thus be seen as a special case of a shallow autoen-

coder. Using autoencoders to perform dimension reduction has the benefit that it can

be generalised to multiple non-linear layers and learn deep features. Where singular

value decomposition will project the data on linear manifolds, autoencoders will be

able to extend this to curved manifolds. This extension will be especially beneficial for

complex, non-linear data.

Figure 2.5: Dimension reduction on 20x20 images of digits using 30 dimensionaldeep autoencoders and principle component analysis [1].

In Figure 2.5 an example of reduction on the MNIST digit data set is shown. The

MNIST has 70 000 images of digits with dimensions 20× 20, thus 400 pixels in total.

The first row of the figure shows an example of each digit in the dataset. The second

and third rows show the reconstruction of each digit using a deep autoencoder and

PCA respectively. For both, the data was reduced from 400 pixels to 30 dimensions

that will contain the most important features of the digits. From these 30 dimensions,

the original 400 pixels will be reconstructed. The PCA does a reasonable job of

reconstructing the dataset, but it is not very spectacular. The autoencoder does very


well in reconstructing the original dataset. In fact, it does arguably better than the

original digits! For example: the upper loop of the digit eight has been fixed, as well as

both ends of the digit zero have been better connected together. In order to preserve

as much information as possible while reducing the dimensions of the dataset, the

autoencoder will have to learn complex features of the data. An unclosed loop in the

digit eight is a rarity in the data set and as a result, the autoencoder did not learn this

feature when only having access to 30 dimensions. Instead, it knows the digit looks a

lot like the other eights and will reconstruct a more general eight.

Dimensionality reduction has a lot of applications. One of them is data visualisation.

When the data is reduced to two dimensions, those dimensions can be visualised on

a biplot. Latent semantic analysis (LSA) is a domain that makes often use of such

biplots. LSA is a natural language processing technique that analyses the link between

words and the documents they originate from. The reasoning behind this is that

words that are similar in meaning will be used in similar contexts. A matrix can be

constructed using a collection of the documents with the frequencies of their most

important words as features. Figure 2.6 shows the biplot of the reduction of such a

matrix to two dimensions using SVD (B) and a deep autoencoder (C). After the data

reduction, the documents are coloured by the type of their content. Compared to the

SVD biplot, the deep autoencoder contains much more structure: documents that are

similar to each other are closer together. When the cosine of the angle between two

codes was used to measure similarity, the autoencoder clearly outperformed SVD (A).

The autoencoder for LSA can also be used in another way: instead of reducing to two

continuous features, one could build an autoencoder that has 32 binary bottleneck

neurons. Each document can then be compressed to a 32 long bit sequence. As

a result, documents that are very similar in content will be very similar in the bit

sequence. Each document has then a hash given by its bit sequence. Such a hash can

be used for fast retrieval of documents with similar content. Using neural networks

to hash and retrieve documents, called semantic hashing, is much faster than other

hashing algorithms [13].

2.2.4 Deep autoencoders for representation learning

Reducing features of a dataset to more usable features for supervised machine learning

is a form of representation learning. Although data reduction will throw away some


Figure 2.6: Reduction of words using singular value decomposition (B) and deepautoencoders (C) and the document retrieval accuracy for these methods (A) [1].

information about the data, the reduced features might be more usable for the pre-

diction task. Features that are useless for the prediction task might also be reduced,

which might prevent overfitting. In the thesis, deep autoencoders will be explored to

reduce the ingredients of cooking recipes. With these reduced features, supervised

machine learning will be performed to predict the region of origin of the recipes. Two

commonly used supervised machine learning algorithms will be used for this purpose:

k-nearest neighbors and quadratic discriminant analysis.

k-nearest neighbors (KNN) is one of the simplest supervised machine learning algo-

rithm. For each point in the parameter space, the k-nearest neighbors in the training

dataset are determined. For classification, the prediction of the outcome variable for

a given observation will then be given by the majority vote of the outcome variables

of the nearest neighbors. To determine the observations nearest to a given point, a

distance measure must be defined. Very often this distance measure will just be the

Euclidean distance. The algorithm is very sensitive to how the features are scaled, since

features that are much larger than others will easily dominate the distance measure.

The features produced by dimension reduction with the SVD or autoencoder all have

the same dimensions, scaling the features before use will thus not be important. The


Figure 2.7: An example of the prediction of three supervised machine learningalgorithms: LDA (left), QDA (right) and KNN (below).

value k of the KNN algorithm will be determined by a 5-fold cross-validation on the

validation data.

Linear discriminant analysis (LDA) is another common supervised machine learning

algorithm. LDA is closely related to PCA: where PCA is an unsupervised machine

learning algorithm that tries to find the directions in the data with the most variation,

LDA is a supervised algorithm that will try to find the directions in the data with

the most variation in the outcome variable. In these directions of greatest variation,

the middle points will be determined. The outcome variable will then be predicted

depending on which side of the middle points the data lies. Quadratic discriminant

analysis (QDA) is an extension of this method that does allow for the classes to have

different covariances, as will be the case for the regions of the recipes. For this reason,

QDA will be used instead of LDA.


2.2.5 Deep autoencoders for collaborative filtering

Data reduction methods can also be used for the purpose of collaborative filtering

to build recommender systems. An example of recommender systems can be found

in the services Netflix and Amazon, where products will be recommended that might

be interesting to the specific user. Collaborative filtering algorithms try to solve this

problem from the following perspective: using the preferences of a customer, what

other products can be recommended to that customer based on other customers with

similar preferences? Data reduction methods can here be used to learn how the pref-

erences of the customers look like: in order to have a good reconstruction, meaningful

features will have to be extracted from the data, while unimportant features will be

thrown away. The observations will be reconstructed to match better with the other

observations. An example of this was already seen in the reconstruction of the digits

zero and eight in the Figure 2.5.

Collaborative filtering can be used on the recipe dataset to create a good recipe from

a selection of ingredients that do not necessarily form a good recipe to start with.

A deep autoencoder must first be trained on a training dataset containing a lot of

recipes. The autoencoder will learn from this dataset how ingredients are combined

in the recipes. The trained network can then be used to recommend adaptations to

the selection of ingredients. These adaptations can simply be done by putting the

selection of ingredients through the network and checking the output. The output

of the autoencoder was forced to have values between zero and one by using the

sigmoid function. Therefore, if an ingredient does not match well with the rest of

the ingredients, this ingredient will have an output close to zero. On the other hand,

ingredients that were not present in the in the selection but would match well, will

have an output more towards one.

The performance of the autoencoder can be tested on the test dataset that has been

set aside for this purpose. Each recipe in the test data will be modified by randomly

adding or removing one ingredients. Adding an ingredient is done by changing the

value in the recipe from zero to one, while removing an ingredient is done by changing

the value from one to zero. The deep autoencoder can then be used to determine

which ingredient has been modified by examining the reconstruction of the ingredi-

ents. If an ingredient was removed, the ingredients that were not included in the

modified recipe were ranked in decreasing order on their reconstruction values. If an

ingredient was added, the ingredients that were included in the modified recipe were


ranked in increasing order. In both cases, the rank of the changed ingredient will be

determined. A low rank implies that the recommender system did well in finding back

which ingredient was changed.

From the ranks of the changed ingredients, several performance measures can be

extracted. For the ranks of the removed ingredients, the mean rank, median rank and

the percentage of recipes with a reconstruction ranking in the top 10 have been used

as performance measures. These performance measures are the same measures that

have been used in De Clercq et al. [2], to enable comparison with their recommender

systems. For the ranks of the added ingredients, the mean rank, median rank and the

percentage of ingredients on the first rank are used.

2.3 Training the network with gradient descent

In the previous section, the different elements of the network architecture have been

explained. In this section, the optimization of the network parameters will be discussed.

Unlike some other machine learning algorithms, it is generally not possible to find an

analytical solution for the parameters of a neural network. However the network can be

optimized using the gradient descent algorithm. The gradient descent algorithm starts

from a certain initialisation of the parameters. From this start point, the direction of

steepest descent of the loss function will be determined. This direction will be given

by the opposite of the gradient of the loss function. A small step will be taken in

the direction of the steepest descent. If this step is small enough, the loss should

be smaller for this new set of parameters. This procedure will be repeated until the

algorithm stops improving the loss, around the global minimum if all goes well. The

exact equation of how to update the network parameters is given by:

θ ← θ − δ∇θJ(θ). (2.7)

The gradient symbol ∇θ represents the vector containing the derivatives of the pa-

rameters θ. This equation also introduces a new parameter δ, which is called the

learning rate. The learning rate determines the size of the steps that will be taken

in the direction of steepest descent. The gradient, which determines the direction of

steepest descent, is also responsible for the size of the steps. This will also help to

converge to a global minimum: if the loss function behaves well enough (has a con-

tinuous derivative) around the convergence point, the gradient will diminish around


this point, slowing down the gradient descent algorithm. Figure 2.8 shows how the

gradient descent algorithm can converge from a certain initial position to the global

minimum.

Figure 2.8: A visualisation of the gradient descent algorithm on a loss functionJ with two parameters θ1 and θ2 [14].

2.3.1 Local minima

When using the gradient descent algorithm to find the optimal solution of a neural

network, there are a number of potential problems that can arise. When the activation

functions of the network are non-linear, the loss function will in general be non-convex.

This means that there will be several local minima in the loss function. It is very possible

for the gradient descent algorithm to converge to one of the local minima instead of

the global minimum by starting from a different position, as depicted in Figure 2.9.

In practice, most local minima do not play a big role in the training and application

of neural networks. These local minima are only abundant in regions of the parameter

space which have a loss close to the loss of the global minimum. For practical appli-

cations, it does not matter if the solution is a local or global minimum, as long as the

loss is close enough to the global minimum.


Figure 2.9: A different initial position can lead to convergence to a local minimainstead of the global minimum [14].

2.3.2 The vanishing gradient problem

There are however some special local minima that can be detrimental to proper con-

vergence of the gradient descent algorithm. As mentioned before, the output of the

network is given by the chaining of the activation functions of the different layers. To

find the derivative of a chained function, the chain rule can be used as defined in the

equation:

f1(f2(x))′ = f ′2(x)f

′1(f2(x)). (2.8)

As can be seen in the equation, the derivatives of the parameters of a certain layer

(f1(f2(x))′) will depend on the values of derivatives of the parameters of the next

layer (f ′1(f2(x))). If the derivatives of all the parameters of a certain layer have very

low values, the derivatives of all the preceding layers will also have very low values. As

a result, the network parameters will hardly update in those layers and the gradient

descent algorithm will not be able to converge to a good solution in a reasonable

amount of time.

This situation is especially likely to occur in deep autoencoders [1], for two reasons:

a high number of layers and a small number of neurons in the bottleneck layer. A lot

of layers can be problematic if the derivatives of the activation functions are smaller

than one, as is the case for the sigmoid function for example. Since the derivatives of

the parameters of a certain layer will be the product of derivatives of the activation


functions of all the next layers, the gradient for the parameters in the first layers can

become very small. This problem is known as the vanishing gradient problem [15].

The bottleneck layers of autoencoders makes the problem even worse. Due to the low

number of neurons in this layer, it is much more likely for all the derivatives of the

bottleneck neurons to become very small, preventing the gradient descent algorithm

to properly work in the layers before the bottleneck layer.

A similar situation occurs when all but one of the bottleneck layer have gradients close

to zero. The gradient descent algorithm will still be able to do some learning through

that one neuron, but the learning will be limited. It is possible that the learning through

this one neuron is not enough to start activating the other neurons during the process.

There are several solutions for these problems. One of them is using pretraining to find

good initial values of the parameters, after which the gradient descent algorithm will

easily converge to a good value [16]. This pretraining is usually done layer by layer,

using restricted Boltzmann machines, an unsupervised deep learning algorithm. In the

thesis, pretraining of difficult network architectures was done using the neuron values

of trained networks with the same amount of layers and neurons, but easier to train

activation functions. Although there is no obvious explanation why this worked by the

author, it was a solution that worked for those network architectures where the normal

initialisation procedure (discussed in the next subsection) failed.

Another solution to the problem of plateaus and the vanishing gradient problem is the

use of appropriate activation functions. The rectifier function derivatives are either zero

or one, while the sigmoid function only has derivatives smaller than one. The gradients

of deep neural networks using rectifier activation functions will as a result not diminish

towards zero unlike the sigmoid activation functions, preventing the vanishing gradient

problem that occurs with many layers of sigmoid functions. The rectifier does have

another problem however: having a zero derivative for all negative input values. For

autoencoders, this can make it likely to have a bottleneck layer where all neurons have

a gradient of zero. To solve this, some modifications of the rectifier function which

have a small but non-zero gradient for the negative input values are invented [17].

This problem only occured for the activation function to the bottleneck layer: using

a linear or sigmoid activation function for this layer prevented the problem. For the

architectures with a rectifier activation function to the bottleneck layer, choosing a

proper initialisation size of the parameters helped to make proper convergence much

more likely.


2.3.3 Initialisation of the network parameters

One could naively initialise all the parameters to the same value, however this approach

would not work: if all the parameters have the same value, their gradients will also

have the same value. After every gradient descent step, they will still have the same

value. The gradient descent algorithm will therefore not be able to work properly.

To solve this problem, the parameters of the network can be initialised with random

values.

The size of the range of these random values is important. If the range of the random

values is too large or too small, the gradient descent algorithm will have to do a lot of

work to adjust the parameters to their appropriate sizes. This will take a lot of time to

compute and the algorithm is more likely to get stuck in one of the plateaus discussed

in the previous subsection.

2.3.4 Minibatch gradient descent

Although the gradient descent algorithm generally works well to converge to a good

solution, it can often take a very long time. One way to improve the algorithm

is to use minibatch gradient descent instead of deterministic gradient descent. In

deterministic gradient, the gradient is calculated over the whole dataset, before the

network parameters are updated with this gradient. This is often not very efficient:

in large datasets, the calculation of the gradient can take a very long time. However,

the accuracy of the gradient descent estimator only increases with a factor√n, with

n being the number of training samples in the dataset. There is also often a lot of

redundancy in the dataset: different data samples give very similar contributions to

the gradient.

Minibatch gradient descent will calculate the gradient only over a small batch of

training samples, typically 10 to 100 samples, before doing a gradient descent step.

For each iteration through the dataset, the order will be randomized and split in the

batches. In order to compensate for the reduced accuracy of the gradient, the learning

rate will have to be reduced. After the gradient descent step, the gradient will be

calculated over the next minibatch, and so on. The gradients calculated over the

minibatches are good enough for the gradient descent algorithm to work, but will

be calculated much faster, allowing the gradient descent algorithm to converge in a

fraction of the time it would take with deterministic gradient descent. In minibatch


gradient descent with a very large training dataset, it is possible for the algorithm to

converge before the end of the dataset is even reached! In general, several epochs

through the dataset will be needed to converge to the best value.

Minibatch gradient descent also has another benefit. The randomness introduced

by the minibatches will have a regularising effect, lowering the degree of overfitting,

making the network generalise better [18]. This regularising effect is the strongest for

batches of size one, but training the network with batches of size one can also take a

very long time. In the case the batches have size one, the algorithm is called stochastic

gradient descent.

2.3.5 Momentum

Another improvement in the gradient descent algorithm can be done by adding a

momentum to the direction of descent. This method of momentum [19] can make

the algorithm converge faster and helps to prevent getting stuck in local minima.

In Equation 2.7 the changes in the network parameters θ are directly related to the

gradient of the loss function. With momentum, the gradient will be used to update a

momentum term for each network parameter as following:

v ← αv − δ∇θJ(θ). (2.9)

This momentum term can be seen as an exponentially decaying moving average of

the past gradients. The momentum term will then be used to update the parameters

using:

θ ← θ + v. (2.10)

The momentum method introduces a new parameter α, the inertia parameter. This

parameter can have values in the range [0, 1[. It is used to determine the fraction of

the previous momentum step that remains in the next step. If α is equal to zero, we

would have no momentum. In case α would be one, the contributions of the past

gradients would just keep adding up in the momentum parameter, possibly leading to

a diverging momentum. By using a value for α smaller than one, a decay is added to

the momentum. The momentum term will be the largest when all the past gradients

are orientated towards the same direction in the parameter space, in which case they

will amplify each other. The inertia will determine the upper bound for the size of

the momentum in relation to the size of the gradients. This factor can be found by


substituting Equation 2.9 repeatedly in Equation 2.10, yielding

δ + δα + δα2 + δα3 + ... =δ

1− α. (2.11)

One can imagine the regions of same loss in the parameter space of a neural network

to look similar to concentric hyperellipses. Some axes of those hyperellipses will be

very short and other axes will be very long. The direction of the gradient in such a

configuration can then be almost perpendicular to the direction of the centre of the

ellipses. This will make the gradient descent algorithm very inefficient. In Figure 2.10

an example is shown of the gradient descent algorithm without momentum in an

elliptical loss function with a long and short axis.

Figure 2.10: Gradient descent without momentum.

Momentum solves this problem: along the axes where the gradient changes direction

often (short axes), the momentum will be diminished, while along the axes where the

gradient does not change often in direction (long axes), the momentum will grow in

size. Figure 2.11 shows the gradient descent algorithm with momentum with α = 0.5.

Figure 2.11: Gradient descent with momentum with an inertia α = 0.5.

2.4 Optimisation of the hyperparameters

The previous chapter discussed how the gradient descent algorithm can be used to

train neural networks. While the gradient descent algorithm works generally very well


for this purpose, a lot depends on properly-chosen parameters for the gradient descent

algorithm. These parameters, different from the actual network parameters, are called

hyperparameters. The choice of the network architecture can also be viewed as being

part of the hyperparameters: the number of layers and neurons in each layer, the

choice of activation functions and loss function and the choice to add a bias term

or not. As mentioned in the subsection about machine learning, in order to test

the true performance of an algorithm, a separate test set needs to be used. The

hyperparameters will be optimized against a measure of the performance. However,

the test set cannot be used for this purpose: optimizing the hyperparameters is also

a form of training of the algorithm. Tuning the hyperparameters against the test

performance could result in overfitting towards this dataset. In such a case, the test

set would not give a realistic measure of the performance anymore. To solve this

problem, a second dataset will be separated from the training data, on which the

hyperparameters will be optimized. This dataset is called the validation dataset.

2.4.1 The gradient descent hyperparameters

The optimal values for the gradient descent algorithm have been first tuned manually

until a reasonable solution was found. After this, an algorithm was used to find the

optimal solution in the region of the solution found by hand.

Originally, grid search was used for this purpose. In grid search, for each parameter a

grid of several values is defined. After this, each parameter combination is executed to

find the best combination. While this approach worked well initially, it was not feasible

in the long term: the training of the more complex autoencoders could take up to

an hour. If for example we picked five values for each of the four gradient descent

parameters, it would take around 26 days to try them all out! Also, the possible values

of the parameters in a grid search are fixed to only five values, while the optimal value

might lie somewhere in between.

In this situation, a random search for hyperparameter optimisation works much better.

A random search will pick the values of parameters randomly in a predefined range of

possible values and will try to find the best set of parameters within a certain number

of attempts. Random search has the obvious benefit of allowing a continuous range

of parameters to try out. But random search has an even bigger benefit: it does

much better in finding a good solution in a multidimensional hyperparameter space

compared to grid search, within the same amount of time. This is because some


hyperparameters will have less effect than others. Grid search will try out several

combinations of these less important parameters while keeping the other parameters

constant, whereas random search will change all hyperparameters on every draw.

In my thesis, this approach has been adapted a bit further: after each draw has

been tried out, the range of the hyperparameters changes to become centered around

the best solution so far. The size of the ranges have also been tuned manually to

become smaller, the closer the search became to the best solution (when the random

search slows down finding better solutions in the region of the best solution so far).

This adapted random search made it possible to find the optimal parameters for each

network architecture within a day. In the next subsections, the effect of the parameters

on the convergence of the gradient descent algorithm will be explained.

2.4.1.1 Learning rate δ

The learning rate is one of the most important parameters of the gradient descent

algorithm. If the learning rate is very high, it is possible for the gradient descent

algorithm to diverge in the loss function.

Figure 2.12: Divergence of the gradient descent algorithm with a too large learn-ing rate on a parabolic loss function [20].


Figure 2.12 shows an example of such a divergence on a parabolic loss function.

Starting from a point on the parabola, taking a too large step in the direction of

steepest descent makes the parameter end up at the other end of the parabola. At this

new point, the gradient is even larger than the starting point. Since the size of the

gradient descent steps is also dependent on the size of the gradient, the next step will

be even larger. This step size can continue to grow this way and lead to divergence of

the loss function.

Figure 2.13: The effect of the different learning rates on the convergence of theloss function with the gradient descent algorithm [21].

But even if the loss function does not diverge under the gradient descent algorithm,

this does not mean that the parameters are well adjusted. In Figure 2.13 the loss

functions during the gradient descent are shown for different learning rates. If the

learning rate is too large, but small enough to converge, rapid progress will be made

towards the global minimum. However, it is possible for such a learning rate to flatten

out too early. Similar to the diverging learning rate, it will be constantly overshooting

the global minimum: the gradient descent will take a step towards the right direction

but too large in size, making it end up on the other side. The learning rate will never

really fully converging like a good learning rate would do in such a case. A too small

learning rate is another problem. The convergence of such a learning rate would take a

very long time. With a small learning rate it is also more likely for the gradient descent

algorithm to get stuck in local minima, saddle points or other flat regions in the loss

function.


2.4.1.2 Batchsize

Another important parameter is the size of the minibatches for the minibatch gradient

descent. As discussed, using batches smaller than the whole dataset can speed up the

gradient descent algorithm a lot. But using a too small batch size can also slow down

the algorithm: the smaller the batch size, the more random the gradient will be. To

compensate for this added randomness, the learning rate will have to be decreased.

While smaller batches will reduce the need to calculate the gradient over a lot of data

points, more steps will be needed to converge. The optimal batch size will be found

by the best trade off.

2.4.1.3 Inertia α

Adding momentum to the gradient descent algorithm is another method by which the

algorithm can converge faster and avoid local minima. The size of the momentum

is determined by the inertia α, from which the upper bound 11−α of the stepsize in

relation to the gradient was derived in Equation 2.11. The momentum will increase

the stepsize by this factor in the directions that maintain the orientation of their

gradients. However, too much momentum will lead to oscillations and instabilities in

the gradient descent algorithm. Reducing the learning rate will help to prevent these

instabilities. The optimal value for the inertia will be found by the trade off between

boosting the directions that maintain orientation of their gradients and keeping a high

enough learning rate in the other directions.

2.4.1.4 Initialisation range

The network parameters need to be initialised at random. One important aspect of

this initialisation is the size of the range of these random values. This size needs to

match the size of the optimal network parameters in order for the gradient descent

algorithm to work well. If the initialisation of the parameters is too small or too large,

the gradient descent algorithm will have to work too hard, making it likely to converge

to one of the trivial local minima solutions discussed in the previous section.


2.4.2 The network architecture

For each (deep) network architecture, it takes a long time to search for the best

gradient descent parameters and train the model with these optimal parameters. The

different network architectures have been optimized manually for the different purposes

of the thesis: using an algorithmic search would not have been feasible given the time

constraints.

2.5 Python and the Theano package

The thesis has been programmed in Python. Python is open source software and has

a large online community that extends the language with many user-written packages.

This makes it convenient for the individual to easily perform computing tasks such as

machine learning. One of the packages in Python that is especially useful for deep

learning is the Theano package [22].

Theano gives the user the option to define variables in a symbolical manner, leaving

calculations with those variables to its software. It will also compile the code to run

faster and give the option to run the code on GPU instead of CPU. GPU’s have the

possibility of running code in parallel. Calculations such as finding the gradients of

each observation of a batch can easily be parallelized, making the algorithm run much

faster on GPU. This option has not been explored in the thesis because of a lack of

compatible hardware. One of the most important aspects of the Theano package is

the efficient calculation of the gradient, the most important computational task when

training the neural network.

2.5.1 Backward propagation of the gradient

One of the limiting factors in the early years of neural networks was the very slow

calculation of the gradients. Because they were so slow to train, there was not a lot

of interest in researching them, halting a lot of the progress in this domain for several

years. To give an example of what it requires to calculate the gradients, imagine a

network with layers X, Y and Z as the last three layers. On the output layer Z a loss

function J(θ) is defined, which has to be optimized with gradient descent. As was

shown in Equation 2.8, the gradients of the network can be found with the chain rule.


The partial derivative of weight wXij of the activation function going to the neuron Xj

can then be written as

∂J(θ)

∂wXij=∑k

∑l

∂Xj

∂wXij

∂Yk∂Xj

∂Zl∂Yk

∂J(θ)

∂Zl. (2.12)

For deep neural networks, this equation includes the calculation and summation over

many variables. A naive implementation would be to do this calculation for each neuron

separately, starting from the first layers. However, there is a way to calculate the

derivatives in a much faster way. Equation 2.12 for weights in the same layer contains

many of the same factors. Also across the different layers, many of the factors are the

same. For example, all the weights in the network will need the terms ∂J(θ)∂Zl

to calculate

its derivative. Instead of making these calculations over and over for each weight, one

could start from the last layer and calculate the derivatives. Subsequently, the weights

of the second last layer could be calculated using previously obtained values and so on,

going back to the first layers. This approach of propagating the error backward through

the network is called the backward propagation method [23] [24]. Not using this

backward propagation algorithm could easily slow down the gradient descent algorithm

by multiple magnitudes. The Theano package makes the implementation for this very

easy: the user just has to define the mathematical relation between the variables,

after which the package will calculate the gradient to all parameters using backward

propagation.

2.5.2 Other packages

There are many other packages for deep learning like Caffe and TensorFlow. Also many

other packages, like Lasagne and Nolearn, have been built on Theano for the purpose

of deep learning. These packages take care of the implementation of the network and

its gradient descent algorithm. While this is useful for commercial purposes, it does

not give the user much incentive to learn how and why these algorithms precisely work.

One of the purposes of the thesis was to get familiar with the concepts of the gradient

descent algorithm, implemented with two of its most important extensions: minibatch

and momentum. Because of this, it has been decided to code the thesis fully using

only the Theano package for the gradient descent.

Chapter 3

Results

3.1 Training the autoencoders

In this thesis, the autoencoder networks are trained using the gradient descent algo-

rithm. The gradient descent algorithm has no inherent endpoint: the convergence of

the algorithm has to be decided by hand or using certain convergence criteria like a

maximum run time. Keeping track of the loss of the network during the gradient de-

scent algorithm can be a useful tool to help with this task. This loss can be checked on

a validation set every certain amount of steps. This sequence of loss values can then

be plotted against the the number of steps. Visual inspection of such a loss function

plot can give a very good idea of the convergence of the gradient descent algorithm,

often much better than predefined convergence criteria. The loss function plots will be

used to compare the effect of different adaptations of the gradient descent algorithm

in the following subsections.

3.1.1 Adding extensions to the gradient descent algorithm

In the first plot of Figure 3.1 the loss function is shown for a simple autoencoder as a toy

model. The autoencoder has linear activation function and one hidden layer containing

two neurons. In other words, the autoencoder will learn to perform singular value

decomposition with two dimensions. After the initialisation of the network parameters,

the values of the bottleneck neurons will all be close to zero. Initially the gradient

descent algorithm will not be able to learn much, due to lack of structure in the

32

Chapter 3. Results 33

(a) Normal gradient descent (544 sec) (b) Dubbel learning rate (270 sec)

(c) Minibatches per 250 (68 sec) (d) Momentum with 0.5 inertia (296 sec)

Figure 3.1: The loss function of the gradient descent algorithm and some exten-sions. The run time is shown between brackets

network. This can be seen as a plateau on the loss function in the start, equal to a

loss of predicting no ingredients for each recipe. Once the gradient descent algorithm

has learned some structure of the data, the first principle component will be learned at

a fast pace, after which the algorithm will slow down again on a second plateau. At this

point, the second principle component has yet to be learned: one of the neuron values

is still close to zero or they have both almost the same value for all the observations.

Either case, the network is not using its full learning capacity. After escaping this

plateau, the gradient descent algorithm will learn the second principle component,

converging to the same loss of the singular value decomposition.

The time it took to train this simple network was pretty long: 554 seconds. In each

of the other plots of the figure the gradient descent algorithm has been improved.

The second plot has a double learning rate and converged in 270 seconds, which is

about half of the original time. Increasing the learning rate will generally speed up

the algorithm, however there is an upper boundary above which the algorithm might


not converge fully or even start to diverge. Minibatches have been used in the third

plot, yielding a convergence time of 68 seconds. Adding minibatches to the gradient

descent algorithm will speed up the calculation of the gradient in each step. In the

fourth plot momentum was added, yielding a convergence time of 296 seconds. Adding

momentum with 0.5 inertia was almost as useful as doubling the learning rate. This is

to be expected: doubling the learning rate will increase the stepsize in all directions,

while adding momentum with 0.5 inertia will double the stepsize in the directions

where the gradients do not change orientation, but diminish for the other directions.

Combining all these optimisations and using the best values will enable the autoencoder

to be fully trained in a few seconds. This is even faster than the analytical calculation

performing singular value decomposition which takes about 15 seconds.

Figure 3.2: Too much momentum can make the gradient descent algorithmunstable (left). A diverging loss function as a results of a too large learning rate

(right)

While the tools discussed above can speed up the algorithm, they can also cause

instability when too extreme parameters are chosen. In Figure 3.2 two examples of

this are shown. The first plot shows the convergence of the gradient descent algorithm

using too much momentum (inertia = 0.95). The algorithm has a lot of fluctuations

up and down before settling to the convergence value. This could have ended even

worse, with a diverging loss function. An example of this is shown in the second plot,

where a too large learning rate was chosen. After some fluctuation, the loss shoots

up. The next values of the loss are not shown on the plot because they grew to the

upper boundary of the numpy floats within the three following steps.

In Figure 3.3 the loss function for a deep autoencoder is shown. The gradient descent

algorithm starts off really fast but slows down rapidly. The loss of the deep autoencoder

almost instantly surpasses the loss value 0.066, which is the loss of a singular value


Figure 3.3: The loss function of a complex network for the training data andvalidation data.

decomposition with the same dimension reduction and cost function. If we let the

gradient descent algorithm train for too long, the training loss might start to dip under

the validation loss for models with a lot of parameters. This did not happen for the

models of this thesis: there was a lot of data and the models did not have too many

parameters in order for overfitting to occur.

3.1.2 Plateaus

In a linear network with least squares cost function, the loss will always be convex,

guaranteeing the gradient descent algorithm to always be able to escape the learning

plateaus with enough small steps. This is not the case for a general loss function. For

a non-linear network it is possible for the gradient descent algorithm to get stuck on

certain plateaus in the loss function. In Figure 3.4 the bottleneck neurons for several

types of plateaus are shown. The gradient descent algorithm can get stuck when all

the bottleneck neurons have a value of zero or close to zero, as is the case for the

first plot of the figure. In the second plot the gradient descent algorithm has learned


(a) (b)

(c)

Figure 3.4: The bottleneck neurons for a gradient descent algorithm getting stuckon a zero dimensional (left), one dimensional (right) or two dimensional (down)

subspace of the bottleneck neurons.

meaningful values for the first neuron but not for the second and is unable to escape

the plateau. In the third plot, the values of the three bottleneck neurons are stuck on

a two-dimensional subspace. The gradient descent algorithm can get stuck in these

plateaus if the hyperparameters are not properly chosen.

3.2 Comparing data reduction methods

3.2.1 Singular value decompostion

In Figure 3.5 biplots of the first two components of singular value decomposition and

its variants are shown. While SVD captures the directions of greatest variation around


zero, PCA captures the directions of greatest variation within the dataset, which is

more useful. When the data gets centered around zero, SVD will learn the same

principle component directions as PCA. Both SVD and PCA are also very sensitive

to the normalisation. Without normalisation the algorithms will try to model the

frequencies of the ingredients and the number of ingredients per recipe. By normalising

the data, these factors are eliminated and more meaningful features of the data will

be learned. To check how well the reduced features maintain the structure of the

data, the recipes are colored by their regions. As expected from the discussion above,

PCA with normalisation seems to preserve the structure the best: recipes with similar

regions lie more together.

(a) SVD (b) SVD with normalisation

(c) PCA (d) PCA with normalisation

Figure 3.5: Biplots of different variations of singular value decomposition

3.2.2 Autoencoders

An autoencoder can learn to perform singular value decomposition when linear acti-

vation functions and a least squares cost function are chosen. When such a model is


fully trained, the bottleneck neurons of the autoencoder should have the same values

as the reduced dimensions of SVD. In Figure 3.6 the two features of SVD are shown

as well as the two bottleneck neurons of an autoencoder that has learned to perform

SVD. The autoencoder manages to reproduce the results of SVD almost perfectly,

aside from some scaling factors.

Figure 3.6: Two dimensional features of SVD (left) versus the two bottleneckneurons of a linear autoencoder with a least squares cost function (right).

Unlike SVD, autoencoders do not need centering and normalisation of the data to

work well. This is because the autoencoder can simply add a bias term to the layers

of its network. This bias term will model the frequencies of the ingredients and the

number of ingredients per recipe. The bottleneck neurons are then free to represent

more meaningful features of the data. Centering and normalisation of the data is also

not preferred for the autoencoder: the values of the input variables will not be fixed on

zero or one anymore. The sigmoid activation function with cross entropy cost function

can then not be used for the output neurons. This will make the network harder to

train, as well as losing the meaning of the values of the input and output layers as the

ingredients being present or not in the recipe.


Figure 3.7: Biplots of the bottleneck neurons of non-linear autoencoders withone hidden layer (left) and three hidden layers (right).

In a complex dataset with a lot of observations, autoencoders can potentially perform

better than SVD by modelling more structure of the data. This is done by adding

more layers and using non-linear activation functions. In Figure 3.7 the bottleneck

neurons of non-linear autoencoders with one and three hidden neurons are shown.

The autoencoder with one hidden layer has cross entropy loss of 0.062, which is better

than 0.066, the cross entropy loss of SVD. This model does better than the first three

plots in Figure 3.5 in preserving the structure, as a result of the added bias term. The

PCA with normalisation still seems to outperform this autoencoder network. Adding

an extra hidden layer on both sides of the bottleneck layer gives the autoencoder shown

on the right side of the plot. This autoencoder has a cross entropy cost of 0.055, much

lower than both the SVD and the autoencoder with one hidden layer. It also seems to

better maintain the structure of the data compared to all the other models.

Figure 3.8: Biplots of the bottleneck neurons of autoencoders with five hiddenlayers. For the left plot a linear activation function was used to the bottleneck layer,

while for the right plot a sigmoid activation function was used.


Adding two more hidden layers improved the performance even further. In Figure 3.8

the bottleneck neurons of autoencoders with five hidden layers are shown. For the left

plot a linear activation function was used to the bottleneck layer, which resulted in

a cross entropy loss of 0.051. For the right plot a sigmoid function was used, which

resulted in a cross entropy loss of 0.048. Both models improved the loss by a lot

compared to the models with less layers. Models with rectifier activation functions

to the bottleneck layer were also tried out, but they were visually not as satisfying.

Adding more layers was also explored, but this did not improve the results by much

compared to five hidden layers.

3.3 Prediction of the regions

In Figure 3.9 the features of the PCA model with normalisation for the training and

testing are shown, as well as the KNN and QDA predictions.

(a) Training data (b) Test data

(c) KNN predictions (d) QDA predictions

Figure 3.9: Biplots of the features of PCA with normalisation.


The recipes of North American region take up 73.4% of the dataset and have recipes

similar to recipes from all over the world. These recipes would dominate the prediction

of regions of all the recipes, which would not be very useful. For this reason, the North

American recipes are removed for the prediction part, but not for the training of the

models. On the reduced features of PCA with normalisation, the KNN classifier has

a prediction accuracy of 55.4%, while the QDA classifier has a prediction accuracy of

55.2%

(a) Training data (b) Test data

(c) KNN predictions (d) QDA predictions

Figure 3.10: The datasets for region prediction on the two bottleneck neuronsof the deep autoencoder model with a linear activation function to the bottleneck

layer.

In Figure 3.10 the features of a deep autoencoder with a linear activation function to

the bottleneck layer for the training and testing are shown, as well as the KNN and

QDA predictions. The KNN classifier has a prediction accuracy of 65.0%, while the

QDA classifier has a prediction accuracy of 57.8%

In Figure 3.11 the features of a deep autoencoder with a linear activation function to

the bottleneck layer for the training and testing are shown, as well as the KNN and


(a) Train (b) Test

(c) knn (d) qda

Figure 3.11: The datasets for region prediction on the two bottleneck neuronsof the deep autoencoder model with a linear activation function to the bottleneck

layer.

QDA predictions. The KNN classifier has a prediction accuracy of 65.4%, while the

QDA classifier has a prediction accuracy of 58.2%. Both deep autoencoder models

outperformed the prediction accuracies of the PCA model with normalisation.

The performance on the raw dataset was also measured. KNN gave the best perfor-

mance with a prediction accuracy of 69.8%. This is very close to the KNN prediction

values of the deep autoencoder models with two bottleneck neurons, while not so

close to the KNN prediction of the PCA with normalisation. This suggests that deep

autoencoders retain much more structure of the data when reducing the dimensions.

As a side experiment, some models were tested with a higher number of bottleneck

neurons, from which a model with 100 bottleneck neurons was selected as the model

with the highest prediction accuracy on the validation dataset. The prediction accu-

racy of this model on the test data was 72.0%. This suggests that deep autoencoder


models can also be useful for representation learning, although the benefit compared

to the raw dataset was minimal.

3.4 Collaboratorive filtering for recipe creation

Several autoencoder architectures have been explored for collaborative filtering. It was

found that deeper models performed better. Using sigmoid activation functions instead

of rectifier activation functions also improved the performance. The author noted also

something different: models that were not fully trained appeared to be better than

models for which the gradient descent algorithm had fully converged. Even though

the fully trained models were not overfitted: the fully trained models had the lowest

cross validation loss and this loss was very close the the training loss. The author does

not see any reason why the not fully trained models performed better. From all the

different models, the one performing the best on the validation dataset was used on

the test dataset.

As explained in Chapter 2, the recipes are modified by randomly either removing or

adding an ingredient as a way to measure the performance. If the autoencoder works

well for collaborative filtering, it should be able to find back the changed ingredient.

For the recipes with a removed ingredient, the ingredients not present in the adapted

recipe will be ranked on how well they would fit the adapted recipe. For the recipes

with an added ingredient, the ingredient of the adapted recipe will be ranked on how

bad they fit the adapted recipe. In both cases, a low rank means the autoencoder did

well in finding back the changed ingredient.

3.4.1 Reconstruction of the removed ingredient

In Table 3.1 an example of recipe retrieval is shown for one of the test recipes. The

original recipe contains the following ingredients: cocoa, cream cheese, eggs, milk,

wheat and vanilla. This recipe has been modified by removing vanilla. The modified

recipe has been put through the network, resulting in reconstruction values for all

ingredients. From ingredients not included in the modified recipe, the five with highest

reconstruction values were selected. The missing ingredient vanilla ranks second with

a reconstruction value of 55%. For this recipe, the model did very well in retrieving the

missing ingredient. Expecting a perfect retrieval is not reasonable: other ingredients


might also combine well with the adapted recipe. The other suggestions in table would

indeed be good combinations with the adapted recipe.

Ingredients cocoa, cream cheese, eggs, milk, wheatSuggestions to add cream vanilla butter yeast vegetable oilReconstruction % 58 55 35 25 20

Table 3.1: The first row contains the ingredients of a recipe from which vanillawas removed. The following row contain the top five of suggestions of ingredientsto add to the adapted recipe, with their reconstruction values in the last row. The

removed ingredient vanilla was ranked second of the suggestions.

In Table 3.2 the performance measures for ingredient retrieval of the autoencoder are

shown and compared to the two models of De Clercq et al. [2]. The deep autoencoder

has mean rank of 25.2 and median rank of 8 for the removed ingredient. This per-

formance is very good: randomly selecting ingredients would result in ranks uniformly

distributed between one and the number of ingredients not in the recipe (there are 381

ingredients in total). Compared to the models of De Clercq et al., the autoencoder

outperforms the non-negative matrix factorisation model and comes close in perfor-

mance to the two-step kernel ridge regression model. Deep autoencoder can thus be

used as an alternative method for collaborative filtering.

Performance measure mean rank median rank % with rank ≤ 10Deep autoencoder 25.2 8 54.5NMF 33.0 12 48.2Two-step KRR 23.6 7 59.1

Table 3.2: Comparing the rank of ingredient reconstruction for the deep autoen-coder, non-negative matrix factorization (NMF) and two-step kernel ridge regression

(Two-step KRR) models.

3.4.2 Elimination of the added ingredient

In Table 3.3 an example of elimination of an added ingredient is shown for one of the

test recipes. The original recipe contains the following ingredients: milk, coffee and

cocoa. This recipe has been modified by adding mustard. The modified recipe has

been put through the network, resulting in the reconstruction values of the ingredients

shown in the table. The added ingredient mustard ranks lowest with a reconstruction

value of 3%. The model did very well in retrieving the added ingredient.

In Table 3.4 the performance measures for ingredient elimination of the autoencoder

are shown. The model has mean rank of 1.5, a median rank of 1 for the removed


Ingredients mustard cocoa coffee milkReconstruction % 3 84 94 96

Table 3.3: A recipe for cappuccino. Mustard has been randomly added as extraingredient and has a very bad reconstruction unlike the other ingredients. Themodel predicts mustard as the first ranked ingredient to eliminate from the recipe.

ingredient and eliminated the correct ingredient 80% of the time. Eliminating an

added ingredient is much easier than retrieving a removed ingredient, since it only

has to pick from the ingredients that are in the adapted recipe. Nevertheless, this

performance is very good.

Performance measure mean rank median rank % first rankDeep autoencoder 1.5 1 78.8

Table 3.4: The performance measures of the elimination of a randomly addedingredient.

Chapter 4

Conclusion and discussion

4.1 Conclusion

In the thesis, it was explored how to train deep autoencoder networks on cooking

recipes using the gradient descent algorithm. Since it can be very hard to train deep

autoencoders [1], several extensions were added to improve the gradient descent algo-

rithm. Adding minibatches and momentum to the gradient descent algorithm helped

to speed up the algorithm and improved the performance. Pretraining the network

with similar, easier to train networks prevented the algorithm from getting stuck in

plateaus with a high loss.

These deep autoencoders were then compared to singular value decomposition for the

purpose of data reduction of the ingredients of recipes to two dimensions. To measure

the performance, the cross entropy loss for the reconstruction of the ingredients was

used. Singular value decomposition had loss of 0.066 while the best deep autoencoder

performed much better with a loss of 0.048.

On the two reduced dimensions of all models, supervised machine learning was used

to predict the regions of the recipes. The supervised learning was done using two

algorithm: KNN and QDA, with KNN outperforming QDA on all models. On the SVD

model, the KNN algorithm has a prediction accuracy of 55.4%. On the deep autoen-

coder models, the KNN algorithm had a much better prediction accuracy: 65.0% for

the model with a linear activation function to the bottleneck neurons and 65.4% for

the model with a sigmoid activation function to the bottleneck neurons. Performing

KNN on the raw dataset resulted in a prediction accuracy of 69.8%, suggesting that

46

Chapter 4. Conclusion and discussion 47

the two bottleneck neurons of the deep autoencoders maintained the structure of the

regions very well. Performing KNN on the reduced features of a deep autoencoder

with 100 bottleneck neurons gave a prediction accuracy of 72.0%, suggesting deep

autoencoders might have some usefulness for representation learning on the dataset.

Separate deep autoencoder models were trained for collaborative filtering with the

purpose of making recommender systems. The performance of these recommender

systems were tested by the ranks of the recommendations of the ingredients that were

randomly either removed from or added to an recipe. De Clercq et al. [2] have build

two similar recommender models on the same dataset. As can be seen in Table 4.1,

the deep autoencoder outperforms the non-negative matrix factorization and comes

close to the two-step kernel ridge regression.

Performance measure mean rank median rank % first rank % top 10Reconstruction DAE 25.2 8 / 54.5Reconstruction NMF 33.0 12 / 48.2Reconstruction two-step KRR 23.6 7 / 59.1Elimination DAE 1.5 1 78.8 /

Table 4.1: The performance measures of reconstruction of a randomly removed in-gredient for the deep autoencoder (DAE), non-negative matrix factorization (NMF)and two-step kernel ridge regression (two-step KRR) models. The elimination per-

formance of the added ingredients are also included.

4.2 Discussion

To improve the gradient descent algorithm even further, a variant of an adaptive

learning rate could be implemented, rather than using a fixed learning rate. Also, only

a certain number of deep autoencoder architectures were explored in the thesis. All

the deep autoencoders models had very little overfitting, since the train and validation

loss were always very close. It is possible that deep autoencoder architectures with

more parameters might fit the data better, although these types of complex models

will likely be even harder to train. The different performance measures explored in the

thesis could be further improved with those models. This could be explored in further

research.

Appendix A

Admission for circulating the work

The author, promoter and co-promotor give permission to consult this master dis-

sertation and to copy it or parts of it for personal use. Each other use falls under

the restrictions of the copyright, in particular concerning the obligation to mention

explicitly the source when using results of this master dissertation.

48

Bibliography

[1] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of

data with neural networks. Science, 313(5786):504–507, 2006.

[2] Marlies De Clercq, Michiel Stock, Bernard De Baets, and Willem Waegeman.

Data-driven recipe completion using machine learning methods. Trends in Food

Science & Technology, 49:1–13, 2016.

[3] Yong-Yeol Ahn, Sebastian E Ahnert, James P Bagrow, and Albert-Laszlo

Barabasi. Flavor network and the principles of food pairing. Scientific reports, 1,

2011.

[4] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George

Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershel-

vam, Marc Lanctot, et al. Mastering the game of go with deep neural networks

and tree search. Nature, 529(7587):484–489, 2016.

[5] Douglas B Lenat and Ramanathan V Guha. Building large knowledge-based sys-

tems; representation and inference in the Cyc project. Addison-Wesley Longman

Publishing Co., Inc., 1989.

[6] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press,

2016. http://www.deeplearningbook.org.

[7] Shlomo Mor-Yosef, Arnon Samueloff, Baruch Modan, Daniel Navot, and Joseph G

Schenker. Ranking the risk factors for cesarean: logistic regression analysis of a

nationwide study. Obstetrics & Gynecology, 75(6):944–947, 1990.

[8] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional

networks. In European conference on computer vision, pages 818–833. Springer,

2014.

49

http://www.deeplearningbook.org

Bibliography 50

[9] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm

for deep belief nets. Neural computation, 18(7):1527–1554, 2006.

[10] Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle, et al. Greedy

layer-wise training of deep networks. Advances in neural information processing

systems, 19:153, 2007.

[11] Frank Seide, Gang Li, and Dong Yu. Conversational speech transcription using

context-dependent deep neural networks. In Interspeech, pages 437–440, 2011.

[12] Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray

Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from

scratch. Journal of Machine Learning Research, 12(Aug):2493–2537, 2011.

[13] Ruslan Salakhutdinov and Geoffrey Hinton. Semantic hashing. International

Journal of Approximate Reasoning, 50(7):969–978, 2009.

[14] Andrew Nguyen. Machine learning course on coursera. https://www.coursera.

org/learn/machine-learning. Accessed: 2017-01-23.

[15] Sepp Hochreiter. The vanishing gradient problem during learning recurrent neural

nets and problem solutions. International Journal of Uncertainty, Fuzziness and

Knowledge-Based Systems, 6(02):107–116, 1998.

[16] Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pas-

cal Vincent, and Samy Bengio. Why does unsupervised pre-training help deep

learning? Journal of Machine Learning Research, 11(Feb):625–660, 2010.

[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into

rectifiers: Surpassing human-level performance on imagenet classification. In

Proceedings of the IEEE international conference on computer vision, pages 1026–

1034, 2015.

[18] D Randall Wilson and Tony R Martinez. The general inefficiency of batch training

for gradient descent learning. Neural Networks, 16(10):1429–1451, 2003.

[19] Boris T Polyak. Some methods of speeding up the convergence of iteration

methods. USSR Computational Mathematics and Mathematical Physics, 4(5):

1–17, 1964.

https://www.coursera.org/learn/machine-learning

https://www.coursera.org/learn/machine-learning

Bibliography 51

[20] Divergence of the gradient descent algorithm with a too large learning rate on

a parabolic loss function. http://www.cs.cornell.edu/courses/cs4780/

2015fa/web/lecturenotes/lecturenote07.html. Accessed: 2017-01-23.

[21] The effect of the different learning rates on the convergence of

the loss function with the gradient descent algorithm. https:

//leonardoaraujosantos.gitbooks.io/artificial-inteligence/

content/more_images/learningrates.jpeg. Accessed: 2017-01-23.

[22] James Bergstra, Olivier Breuleux, Frederic Bastien, Pascal Lamblin, Razvan Pas-

canu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Ben-

gio. Theano: A cpu and gpu math compiler in python. In Proc. 9th Python in

Science Conf, pages 1–7, 2010.

[23] Martin Fodslette Møller. A scaled conjugate gradient algorithm for fast supervised

learning. Neural networks, 6(4):525–533, 1993.

[24] Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E

Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to

handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.

http://www.cs.cornell.edu/courses/cs4780/2015fa/web/lecturenotes/lecturenote07.html

http://www.cs.cornell.edu/courses/cs4780/2015fa/web/lecturenotes/lecturenote07.html

https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/more_images/learningrates.jpeg



exploration of deep autoencoders on cooking...

Documents