neural nets and deep learning

7/25/2019 Neural Nets and Deep Learning

1/36

Measuring image classification precision of neural

networks with different architectures

Manuel Carbonell - Autonomous University of Barcelona

Master in Modelling for Science and Engineering

Tutor: Ruben Tous

May 31, 2016

www.github.com/manucarbonell/convnet

Abstract

In this article we study artificial neural networks models applied to

computer vision, and also how their architecture modifications affect to

the performance of their training and precision of the predictions.

Contents

1 Introduction and previous concepts 21.1 Motivation and objectives . . . . . . . . . . . . . . . . . . . . . . 21.2 Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Feedforward neural networks . . . . . . . . . . . . . . . . . . . . 4

1.3.1 Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3.2 Perceptron output with step function . . . . . . . . . . . 51.3.3 An example . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3.4 Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3.5 Network training and sigmoid neurons . . . . . . . . . . . 6

1.4 Learning with gradient descent . . . . . . . . . . . . . . . . . . . 81.4.1 Cost function . . . . . . . . . . . . . . . . . . . . . . . . . 81.4.2 Backpropagation algorithm equations . . . . . . . . . . . 91.4.3 Backpropagation algorithm steps . . . . . . . . . . . . . . 11

1.5 Types of layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.5.1 Convolutional layers . . . . . . . . . . . . . . . . . . . . . 131.5.2 An example . . . . . . . . . . . . . . . . . . . . . . . . . . 141.5.3 Pooling Layers . . . . . . . . . . . . . . . . . . . . . . . . 151.5.4 Rectifier linear units layers . . . . . . . . . . . . . . . . . 161.5.5 Local Response Normalization layers . . . . . . . . . . . . 161.5.6 Softmax layers . . . . . . . . . . . . . . . . . . . . . . . . 16

1.6 Why are CNN so effective on image data? . . . . . . . . . . . . . 171.7 Tensorflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1
http://www.github.com/manucarbonell/convnethttp://www.github.com/manucarbonell/convnet


2/36

2 Related work 17

2.1 Universal approximation of functions . . . . . . . . . . . . . . . 172.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . 182.3 Deep Belief Networks (DBNs) . . . . . . . . . . . . . . . . . . . . 182.4 Deep Dream. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Methodology 213.1 The dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2 Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3 Hardware setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.4 Implemented and customized programs. . . . . . . . . . . . . . . 23

4 Results 254.1 Single softmax layer network . . . . . . . . . . . . . . . . . . . . 25

4.2 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.3 Convolution and pooling . . . . . . . . . . . . . . . . . . . . . . . 264.4 Two convolutional and pooling layers. . . . . . . . . . . . . . . . 284.5 Image size augmentation . . . . . . . . . . . . . . . . . . . . . . . 294.6 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.7 Data amount augmentation . . . . . . . . . . . . . . . . . . . . . 314.8 Batch size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.9 Extracted features . . . . . . . . . . . . . . . . . . . . . . . . . . 324.10 Retrain ImageNet Model. . . . . . . . . . . . . . . . . . . . . . . 34

4.10.1 Bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Conclusions and future work 35

1 Introduction and previous concepts

1.1 Motivation and objectives

We first do a general contextualization and introduction to ANNs (artificialneural networks), and their parts, and then we will describe how to train afully connected network with gradient descent method. Then we check howessential is every part of the network by running a training algorithm withdifferent setups. On the way we get introduced to the recently launched libraryTensorflow and discuss obtained results.

1.2 Artificial Intelligence

Artificial neural networks is one of the trend research topics in artificial intelli-gence (AI). Inside the main faced goals in which intelligence simulation is split,which are deduction, knowledge representation, planning, natural language pro-cessing (communication), perception, motion and manipulation and learning,neural networks belongs to the last one, more specifically called machine learn-ing (ML). In that branch of AI we study algorithms that improve automaticallythrough experience, by means of approximating functions according to a givendata, usually called training datathats why when running ANNs algorithmswe say that the network is learning or being trained.

2


3/36

Machine learning is split between supervised, unsupervised and reinforce-

ment learning. In unsupervised learning the objective is to infer a functionto describe hidden structure from unlabeled data, f.e. finding similarity be-tween each data point, clustering data or reducing its dimension. Examples ofunsupervised learning are k-means, maximum likelihood estimation, principalcomponent analysis. Supervised learning consists of using labeled training datato estimate a map that returns the right label when receiving new data accord-ing to the same pattern as in the training set. Convolution neural networksare an example of supervised learning, where for example we classify images bytheir label according to a labeled training data set. In reinforcement learning,programs are rewarded by taking actions in an environment so as to maximizesome notion of cumulative reward.Theres disagreement about whether biological foundations are important forcontinuing the development of AI as it happens with ANNs, which are inspired

in biological neurons, or it should be a completely independent research field thesame way bird biology has no contribution in most of aeronautical engineeringideas. Until now AI research has been mostly statistical related. In many AIspecific tasks, g.e. recognizing a song, where a fingerprint of the audio frequen-cies is generated, a purely statistical method produces results similar that ahuman would give. But are we then doing just a classical statistics work? Wecould find somedifferences between classical statistics and AI, which are:

Thedimensionsof the data. In classical statistics we have low dimensiondata sets, e.g. less than 100 dimensions, in AI we can have much morethan that.

In classical statistics we have a lot ofnoisein the data, which might make

it difficult to find a structure, in AI the noise is not sufficient to hide thestructure of the data when properly processed.

In classical statistics theres not much structurein the data and if thereis, it can be represented with a simple model, with not many parameters,in AI the structure is too complicated to be represented with a simplemodel with few parameters.

Usually the objectivewhen doing classical statistics is to reveal a struc-ture hidden by noise, in AI the objective is to get to present a complicatedstructure in a way that can be learned.

1.2.1 History

The first scientific approach to artificial neural networks was done by WarrenMcCulloch and Walter Pitts[1] in 1943. Their objective was to mathematicallyformalize the behavior that brain neurons have when we perform logical reason-ing, or when reading our sensess inputs.In 1940 neuroscientist Donald Hebb propposed a theory for the adaptation ofthe neurons in the brain during the learning process, which was named afterhim,hebbian learning. He already thought of the idea of weights in connec-tions between neurons, which appear in present artificial neural network (ANN)models.The model states that the weight between two neurons increases if thetwo neurons activate simultaneously, and reduces if they activate separately.

3


4/36

This is not how convolutional neural networks (CNN) work but it was a start.

In 1948 Alan Turing suggested a model of computation called unorganizedmachine, thinking of the human cortex of an infant, which is largely randominitially but can be trained to perform particular tasks. In this model Turingdefined A-type machines, which consisted of randomly connected networks ofNAND logic gates, and B-type machines, which where built taking A-type ma-chines and substituting inter-node connections with structures called connectionmodifiers, which where made of A-type nodes. This connection modifiers wheresupposed to undergo appropriate interference, mimicking education.

Frank Rosenblatt first created an electronic device called perceptron in1960, which represented a neuron as a logical gate with weights and bias. Theutility of the model though was not observable due to lack of computing re-sources, thats why the idea of neural network was not trend till recent times.It was in 1975 when Paul Werbos [4] thought of applying backpropagation

algorithm to find an optimal solution for the parameters in a neural network,which greatly improved the problem-solving capability of a neural network, andis now the state of the art for image recognition. After this advance manyresearch in this field was done again and the goodness of the predictions grad-ually improved till present times where ANN are state of the art for imagerecognition, achieving human level precision and substituting methods such assupport vector machine or random forest, as it happen in ImageNet 2012 contestwhere winner team of image classification used a convolutional neural network[2]standing out of all other methods. Next we introduce the components of aconvolutional neural network in present time.

1.3 Feedforward neural networks

1.3.1 Perceptrons

Anartificial neural networkis a graph, where nodes are a special kind of log-ical gates called perceptronsor artificial neurons, which have some param-eters that allow their behavior change so that patterns can be recognized in datasets.

Figure 1: Animal neurons gradually

change with synapses

They are called neural networks becausethe original model was supposed to be in-spired in animal neurons, which have theproperty to gradually chemically changewhen doing synapses, and getting toextract an abstract concept when con-

nected, which also happens with artificialneural networks model, where values ofparameters are changed instead of physi-cal and chemical properties. Perceptronsreceive many inputs and compute an out-put by means of weights and biases. We

can imagine perceptrons as decision taking units considering different sourcesof information and giving different importance to these sources.

4


5/36

xi Inputs

wi Weights

a Output

Figure 2: Perceptron with 3 inputs and 1 output

1.3.2 Perceptron output with step function

We first represent the output a of perceptron with inputs {xi}mi=1 weighted by

{wi}mi=1 and bias b with the step function:

a(x1,...,xm, w1,...,wm, b) =

0 if

mi=1xiwi < b

1 ifm

i=1xiwi b(1)

1.3.3 An example

A real-life example of perceptron could be the decision of enrolling or not in agiven master program. The input variables could be:The subjects are or not of your interest.

The references you have of job possibilities after the master are good or bad.The university is close or not to your place.You would take a positive decision if the sum of the weighted input variables isgreater than a given bias. Lets say you give a weight of 5 to the interest of thesubjects, a 4 to the goodness of job opportunities and 3 to the closeness of theuni. Then if the bias is for example 5 it would be enough to have an interestingcontent in the subjects to enroll in the master, but if the bias is for exampleseven then two conditions should be positive to enroll in the master.From now on we will use the following notation for inputs weights and biases:wl := (wljk ), is the weights matrix for layer l where each element of the matrix

wljk denotes the weight of the connection from the k-th neuron in the (l 1)-th layer to the j-th neuron in the l-th layer. x := (x1,...,xm)denotes theinput vector, and z lj = kwljk al1k +blj is the weighted input to the activationfunction for neuronj in layer l .

5


6/36

Figure 3: Step function.

1.3.4 Layers

Perceptrons are organized in different layers which represent levels of abstractionof the decision taking, from lower to higher. That means first layers recognizeconcrete non-general patterns in the data, and last layers give an abstract clas-sification of the data, such as is the picture a 0 or a 1, or is there an eye on thispicture? Many layers together form aneural network.

Inputlayer

Hiddenlayers

Outputlayer

Figure 4: Neural network with 4 layers.

The first layer is calledinput layer, which receives the information that hasto be processed by the network (in our case image pixel intensities). Comingup next are hidden layers. In this example picture we only show two hiddenlayers but we can find networks with 12 hidden layers, for example. Hiddenlayers process the input layer outputs to give the output layera final result.

1.3.5 Network training and sigmoid neurons

Our purpose is to get the network to give the output result we want whenwe give a determined input. For this we will proceed with a method called

6


7/36

network training or learning. The idea of training the network consists of

giving a big amount of input data together with the expected output results,and adapt the network parameters to fit as good as possible these expectedresults. Lets say we want the network to recognize if faces are appearing ornot in a picture, then we will give as input many pictures that are containinga face and many pictures that dont, and the labels of the images telling ifa face actually appears. For every picture we gradually modify weights andbiases of our network so that the output is every time more frequently the sameas the label. To do that we use gradient descent method which we explainbelow. After the training, the network should be able to recognize if theres aface or not in a new input picture with a high accuracy. The above describedperceptrons are an intuitive approximate simple idea of neuron model that wasdeveloped into sigmoid neurons, which we define below. The output function ofa sigmoid neuron is not as the one described in (1), since with a step function, the

behavior of the network would be chaotic when modifying weights and biases.We introduce then sigmoid neurons, which compute their output with sigmoidfunction. The outputa of a sigmoid neuron is defined with sigmoid functionwhich is a smooth-shaped version of the step function.

(z) =tanh(z) = 1

1 +ez (2)

Figure 5: Sigmoid function.

The reason why such a function is chosen for the perceptron output is itssmooth shape and the property that makes it similar to step function: for inputvalues multiplied by weights that are much greater than bias (i.e. z ) theoutput is close to 1, and equivalently for (z ) the output is close to 0.The important difference with step function is that this time when we slightly

change the weights and biases of a perceptron the output is going to slightlychange too, due to sigmoid function continuity and smoothness. This is goingto allow us to search for the optimal weights and biases of each perceptrons.t. we get the targeted output when giving an input by using gradient descentmethod. Putting together the definitions of neural network and sigmoid neuron,the activation of the j th neuron in layer l is going to be:

alj =

k

wljk al1k +b

lj

=(zlj ) (3)

where wljk is the weight of the kth neuron of l 1th layer activation into jth

neuron oflth layer and blj is the bias of the j th neuron in the lth layer.

7


8/36

1.4 Learning with gradient descent

1.4.1 Cost function

We denote the target outputor desired output of the network when x is theinput by y(x), and the neuron outputby a(x,w,b)=a(z) (the desired outputdoesnt depend on the weights and biases of neurons but the neuron outputdoes). We want the training algorithm to determine which weights and biasesapproximate best the outputs a(x,w,b) to y (x) for all inputs x.We define a cost function(also called loss function) as a measurement of thegoodness of the fit of a neural network with weights and biases w and b to atargety (x). First we have the quadratic cost function:

Cx(w, b) = x1

2n(a(x,w,b) y(x))2 (4)

Wheren is the number of training inputs and the sum is over each input x. Inthe quadratic cost function we can easily observe the main properties that anycost function should have:

It is positive in all the function domain.

The more outputs of the network are different from the label, the higheris the value taken by the function.

The cost function can be written as an average C = 1n

xCx over costfunctionsCx for individual training examples, x.

It can be written as a function of the outputs from the neural network.

We will write from now on a instead ofa(z) and y instead ofy(x) to easethe reading.

Cost functions are defined with the objective to find some weight values wand biases b such that the output a is as frequent as possible the same as thetarget y, equivalently, to find a minimum of the function Cby varying w andb. We could use an analytic method by solving the equation matching gradientto zero to find local minimums and check which one is the lowest, but sincethe number of variables is going to be very large and the shape of the functiontends to be pretty complicated that method would be too costly and we wouldprobably not get close to the real minimum, thats why we are going to usegradient descent method instead.

Gradient descentmethod consists of gradually get closer to a local or absoluteminimum (w0, b0) of the function by means of subtracting the gradient scaled bya small value calledlearning rate, based on the fact that in a multidimensionalscalar function the gradient vector indicates the direction of maximum growth,so the opposite vector indicates the maximum descent. So in each step of thegradient descent method the weights and biases would change following:

(w, b)n+1= (w, b)n Cx(w, b) (5)

Where w is the weights vector, b the biases, the learning rate, x is a fixedinput and C(x,w,b) the cost function.

8


9/36

1.4.2 Backpropagation algorithm equations

Now the question is, how do we calculate the gradient of this cost function,of which we dont know even the concrete expression? The answer is to usebackpropagationalgorithm. Before describing backpropagation algorithm weneed to define a couple of equations.We define theerrorof a neuron j in layerl as the variation of the cost functionwith respect to the weighted inputs plus bias in that neuron:

lj := C

z lj(6)

For two matrices,A, B, of the same dimensions,mnthe Hadamard product,or element-wise matrix product, A B, is a matrix, of the same dimension asthe operands, with elements given by

(A B)i,j = (A)i,j (B)i,j (7)

Claim 1. LetL be a the number of layers of a neural network andj one of itsneurons, then we have the following equality for the neuron error:

Lj = C

aLj(zLj) (8)

in matrix form

L = aC (zL)

Proof. Lets first apply the definition of error for a neuron in layer L.

Lj

= C

z Lj(9)

We have to develop the right term of the equality until we get to the right termin equation (8). If we derive the cost function applying the chain rule and takingin account that the activations aLk of neurons in layer L depend ofz

Lj we get

the intermediate step:

Lj =

k

C

aLk

aLkz Lj

(10)

where the sum is over all the neurons in the output layer. The activation ofa neuron in a given layer only depends of the input that receives that same

neuron, not the other neurons of layer so the term aLk

zLjis equal to zero when

k =j . In consequence we have

Lj = C

aLj

aLjz Lj

(11)

but from (1.4.3) we know that aLj

zLj= (zj ), which finishes the proof.

Claim 2. The errors of two consecutive neural network layers are related bythe following equality:

l = ((wl+1)Tl+1) (zl) (12)

Where(wl+1)T is the transpose of the weight matrix for layer l+ 1.

9


10/36

Proof. Taking in account the relation of the input of a neuron in a layer with

the inputs of previous layer we can write

lj = C

z lj(13)

=

k

C

z l+1k

z l+1kz lj

(14)

=

k

z l+1kz lj

l+1k (15)

and after the definitions ofz and its activations

zl+1k = j

wl+1kj alj+ b

l+1k =

j

wl+1kj (zlj ) +b

l+1k (16)

If we differentiate we getz l+1k

z lj=wl+1kj

(zlj) (17)

substituting back in previous expression we get

lj =

k

wl+1kj l+1k

(zlj ) (18)

and in matrix forml = ((wl+1)Tl+1) (zl) (19)

Claim 3. We have the following equalities for the gradient components

C

blj=lj;

C

wljk=al1k

lj

Proof. For the first equality we differentiate the expression ofzlj =w

lj x

l Tj +

blj (equivalently

kwljk a

l1k +b

lj ) with respect to b

lj

C

blj=

C

z lj

z ljblj

(20)

= C

z lj 1 =lj (21)

10


11/36

and for the second

C

w ljk=

C

z lj

z ljwljk

(22)

= ljz lj

wljk(23)

=

kwljk a

l1k +b

lj

wljk(24)

= al1k lj (25)

1.4.3 Backpropagation algorithm steps

Now that we have shown all the necessary equations, we can list the steps ofbackpropagation algorithm to calculate gradient of the cost function. Denotingax,l = (al1,...,a

ln)as the vector of neurons activations in layer l when x is the

input, wherealj is defined in .

Input: For all neurons in input layer, set neuron values a1 to the corre-sponding values of pixel intensities in the example image.

Feedforward:For each layer l {2,...,L 1} do:zl =wlal1 +bl andalj =(z

l)

Output error L: Calculate the output error

L = aC (zL)

Backpropagation: After calculating the last layer error we backpropa-gate it until first layers:l = ((wl+1)Tl+1) (zl) for l in {L 1, ..., 2}

Output gradient components: We compute the components of thecost function gradient as given in claim3.

C

blj=lj ;

C

wljk=al1k

lj

This process is done for all examplesx in a given subset of the training setusually calledbatch, and then the weights are updated (gradient descentstep)

wl wl

m

x

x,l(ax,l1)T (26)

11


12/36

b

l

bl

m

x

x,l

(27)

Notice that the output error is very simple to calculate in case its a quadraticcost function:

L = aC (zL) = (aL y)(zL)(1 (zL)) (28)

since

(z) = ez

(1 +ez)2 =

1

(1 +ez)

1 +ez 1

(1 +ez) =(z)(1 (z)) (29)

The backpropagation algorithm description, concretely equation (8) tells us thatthe variation of weights and biases in every step depends on the derivative of

the output function, which will be sigmoid in general. When looking at sigmoidfunction limits we see that limz

(z) = limz(z) = 0 so for very large

or very small values ofz the variation of the cost will be very small. It is the casethat sometimes that happens and the learning process gets increasingly slower.When that happens we say that the neuron is saturated. To avoid learningslowdown by neuron saturation, we use the cross-entropycost function

C= 1

n

x

j

yjln a

Lj + (1 yj )ln(1 a

Lj)

(30)

The motivation to use such a function is that, in addition to fulfill the desiredproperties of a cost function mentioned before, when we calculate the partialderivatives of the cross entropy cost function C , they dont depend on the

derivative of the activation function(z) which as we said causes the saturation.Indeed if we calculate the derivative of the cross entropy cost function withrespect to the weight:

C

wj=

1

n

x

y

(z)

(1 y)

1 (z)

wj(31)

= 1

n

x

y

(z)

(1 y)

1 (z)

(z)xj (32)

= 1

n

x

(z)xj(z)(1 (z))

((z) y) (33)

=

1

n

x xj ((z) y) (34)

and with respect to the bias:

C

b =

1

n

x

((z) y) (35)

Anyway when we use linear neurons, that is neurons with a linear activationfunction (not constant), the neuron saturation doesnt happen, because theirderivative is not zero or asymptotically close to it, so in that case we could usequadratic cost function.

12


13/36

1.5 Types of layers

We have now a general idea of how a feedforward neural network works. Wefocus now on some types of layers contained in feedforward neural networks.We studied some of them and they are all implemented in the library that weused, but there was not enough time to test all of them.

1.5.1 Convolutional layers

In the previous description we supposed that each perceptron of a layer wasconnected to all perceptrons of next layer, but thats not how convolution net-works are generally built. For image recognition, the training performance andaccuracy of the network prediction improves if we use the so called local recep-tive fields. It is clear that its not as important the relation between two pixelsthat are next to each other as between two pixels that are in opposite sidesof the image. When we recognize a pattern in an image we scan the imagelooking for concrete shapes in parts of the image. This is what characterizesconvolutional layers, where a reduced size region of the input image layer isconnected with a single neuron of the hidden layer next to it.

Sliding the local receptive field, also called kernel to the right by one neuron, orany number of neurons defined as stride, we connect the obtained new regionwith the next hidden neuron, by saving its activation, and do so for the wholelayer.

So if for example we choose a region of 5x5 neurons, the activation describedin equation(3) becomes

al+1jk =

b+

4m=0

4n=0

wm,nalj+m,k+n

. (36)

13


14/36

The previous operation is called convolution, which gives the name to the net-

work model. We notice that the weights and biases are shared for all localreceptive fields, so with this process we are checking in which degree a feature ispresent all across the image, and slightly modify it during the learning process.Also this way we greatly reduce the number of parameters compared with thefully connected network, and get a more meaningful information for each neuronin the hidden layer.We call the map from local receptive fields to a hidden layer a feature map.For a convolutional layer we can have many feature maps, this way we can rec-ognize different shapes on the images. So to say each feature map tells us if ina region a given pattern is present or not, with a real value between 0 and 1.Having the kernel defined as above, the output of a convolutional layer wouldhave smaller dimensions than the input, but theres cases when this is not thecase. In some cases we use an enhancement of the layer on the edges with mean

values calledpaddingto filter the layer and get an output of the same size andshape as the layer.

1.5.2 An example

Lets say for example we want a network to recognize if an eye is present on apicture. A possible feature map with a 4 neuron local receptive field telling ifthe following features are present in the right position would tell a case in whichan eye is present:

The previous layer would have another a feature map to recognize if theabove features are present or not.

14


15/36

and so on for every lower level of abstraction until we get to the input layerwith the image data.

1.5.3 Pooling Layers

After a convolution layer we usually have pooling layers, which simplify the

information of the previous layer. A commonly used one is max-pooling layer,which takes the maximum value of the activation in a given region, say 2 2neurons of previous layer.

al+1jk =max{a2j+m,2k+n}m,n(0,1) (37)

The idea of max-pooling is to summarize in a layer the most relevant infor-

mation of the feature maps, if they appear or not in a approximate part of theimage, since we dont care about the exact position of a feature when we arelooking for patterns.

15


16/36

1.5.4 Rectifier linear units layers

As we explained before, when using sigmoid output for all the neurons in a layerit can happen that the state of many neurons becomes saturated, due to theshape of this output function. The rectifier layers are characterized by havingtheir neuronss activation function defined as

f(x) = max(0, x) (38)

Using this kind of output we will avoid saturation, so this kind of layers areusually combined with convolutional layers with sigmoid function. A smoothapproximation to the rectifier is the analytic function

f(x) = ln(1 +ex) (39)

which is called the softplus function. This layers have the property of acceler-ating the learning process, that is achieving a lower cost value in less steps.

1.5.5 Local Response Normalization layers

The local response normalization layers perform a kind of lateral inhibitionby normalizing over local input regions. After Krizhevskys work we know thatthey should improve the classification goodness. Their activation function isgiven by

aix,y =zix,y/

k+

min(N1,i+n/2)j=max(0,in/2)

(zjx,y)2

(40)

wherez ix,y is the activity of a neuron computed by applying kernel i at position(x, y) and then applying ReLU nonlinearity aix,y is the response normalized

activity ofz ix,y. The sum runs over n adjacent kernel maps at the same spatialposition and Nis the total number of kernels in the layer. Details about otherparameters can be found in[3].

1.5.6 Softmax layers

Softmax layers transform the activations from previous layer into a probabilitydistribution, keeping the same information of the activations. Each neuron ofthe softmax layer has the following activation function:

aLj = ez

Lj

kezLk

(41)

The activations of the previous layer z Lj are not necessarily between 0 and 1and summing 1 for the whole layer, so with the softmax layer we get sure thatwe have a better representation of the probability that the image belongs to aparticular class. We will actually not count softmax as a layer helping to trainthe network, but a layer that helps to make the classification results humanreadable.

16


17/36

1.6 Why are CNN so effective on image data?

We studied the importance of convolutional networks because they have beenmajor advance for working on images, sounds and video data. But why arethey mostly only used in this kind of data? The most plausible reason is thatit has a common fundamental property: the local stationarity and multi-scalecompositional structure, that allows expressing long range interactions in termsof shorter, localized interactions. That is, video and image data have alwaysa particular property that if we look the data close enough we see that theresa usually a smoothness of values, for example in any image part, if we zoomclose enough we see that the color changes gradually from one to another. Thissmoothness of values might be an advantage to make gradient descent trainingwork fine. By multi-scale compositional structure we mean that for differentscales of observation of the data there are always given patterns that let us

identify concepts, when looking close its types of edges and points, when lookingfar its different complex geometrical shapes.

1.7 Tensorflow

To run the needed operations for training a neural network we used Googlesrecently launched open source deep learning library Tensorflow. TensorFlowis an open source software library for numerical computation using data flowgraphs. It substitutes previous libraries with similar purposes such as Theanoor SciKit-Learn. In each graph nodes represent mathematical operations, fromsimple ones like matrix multiplication and addition to more complex like con-volution or softmax. Graph edges represent the multidimensional data arrays(tensors) communicated between them. The flexible architecture allowed to

deploy computation to one or more CPUs or GPUs. Tensorflow has alreadysome implemented convolutional networks to classify images and other predic-tion tasks.

2 Related work

2.1 Universal approximation of functions

The way artificial neural networks have evolved until today, where they areuseful to solve many classification problems with good results was heuristic, itwas not analytically and with deductive steps determined that neural networkscould properly model certain types of data such as audio and video, but it can be

analytically proven that linear combinations of sigmoid functions can uniformlyapproximate any continuous function, which tells that we could approximateany data set with neural networks. Details can be found in [6]. In spite of thisformalization of the function approximation capability of ANN it is acceptedthat they have a black box nature in terms of the feature extraction. It is notexactly known the interpretation of the weights and biases learned, although wecould observe that basic shapes that might be present in images are identifiedas features.

17


18/36

2.2 Recurrent Neural Networks

In our work we used all the time feedforward neural networks, which propagatethe activations in one direction, but it is also important to remark that theresother types of commonly used ANNs asrecurrent neural networks, in whichconnection form a directed cyclic graph.

2.3 Deep Belief Networks (DBNs)

The main condition to use CNNs is to have labeled data, which in most ofcases in life doesnt happen. Sometimes we have similar kind of problems, butneed to be solved in an unsupervised way, and to do this we can use deepbelief networks, which are the unsupervised learning version of artificial neuralnetworks. Another inconvenient with CNNs and backpropagation is that weights

and biases can get stuck in a poor local optima, making the model stay far fromgood prediction results. So to overcome this limitations Smolensky [7]thoughtof a network that learn hidden patterns on the data. So the idea is to haveonly one visible layer, many hidden and infer states of hidden variables forsome visible variables states, being later able to generate new visible variablessamples. In case of images, we would learn the probability of some featuresappearing in a given image, without it being labeled.

DBNs are composed byRestricted Boltzmann Machines (RBMs). RBMsare simpler than CNNs version of ANNs that learn a probability distributionover a set of inputs. In case of image sets, the network learns a set of featuresgiven an input image dataset. This can be used to initialize deep neural net-works features values. RBMs only have 2 layers, a visible one with mneurons,(in this method also called units) and a hidden one with n units, with binary

boolean values. The same way as it happens in CNNs, theres a weight matrixW = (wi,j ) of size m n ,where wi,j determines the weight of connection be-tween visible unitvi and hidden unithj and also biases, ai for visible units andbi for hidden units. In RBMs we have a function that associates a scalar valuecalledenergy to each configuration of the variables:

E(v, h) = m

i=1

aivi n

j=1

bjhjm

i=1

mj=1

viwi,j hj (42)

in matrix notation,E(v, h) = aTv bTh vTW h (43)

Learning corresponds to modifying that energy function so that its shape hasdesirable properties. We would like plausible or desirable configurations to havelow energy. We also have a probability distribution for each configurationof the network, which depends of the energy function:

P(v, h) = 1

ZeE(v,h) (44)

being Z =

(v,h)eE(v,h) a normalizing constant to ensure the probability

distribution sums 1. The sum is over all possible configurations of visible andhidden units. Plausible configurations should have a higher probability value,that is a energy function value close as possible to 0. In a similar way we have

18


19/36

the probability of a given visible units vector is the normalized sum of energy

functions exponential over all possible hidden units configurations.

P(v) = 1

Z

h

eE(v,h) (45)

Visible as well as hidden units activations are intralayer independent, thatswhy theyre called restricted. For this reason we have that the conditionalprobability of a configuration of the visible units v, given a configuration of thehidden unitsh, is

P(v|h) =m

i=1

P(vi|h) (46)

and the other way arround, the conditional probability ofhgiven v is

P(h|v) =n

j=1

P(hj |v) (47)

The individual activation probabilities are given by

P(hj = 1|v) =

bj+

mi=1

wi,j vi

(48)

and

P(vi = 1|h) =

ai+n

j=1wi,j hj

(49)

where is sigmoid function described in the introduction. Given a trainingset V the idea is to maximize the product of configuration probabilities P(v)varying the weights, that is to find

arg maxW

vV

P(v) (50)

equivalently, to maximize the expected log probability of V:

arg maxW

E

vV

log P(v)

(51)

To do that it is commonly used the Stochastic Maximum Likelihood (SML)or Persistent Contrastive Divergence (PCD) algorithm. This would bethe equivalent to backpropagation algorithm which is also performed inside gra-dient descent, to find the optimum weights. The algorithm computes negativeand positive gradient to calculate the gradient descent step. The procedure fora sample of values for the visible layer is as follows:

Take a training sample v, compute the probabilities of the hidden unitsand sample a hidden activation vectorhfrom this probability distribution.

Compute the thepositive gradient, which is the outer product betweenv and h.

19


20/36

Fromh, sample a reconstruction v of the visible units, then resample the

hidden activations h

from this. (Gibbs sampling step) Compute the negative gradientvhT.

Let theweight updatetowi,j be the positive gradient minus the negativegradient, times some learning rate: wi,j =(vh

T vhT).

In a similar way we update biases a and b.This algorithm is implemented in the SciKit Learn library BernoulliRMB andas an example we can see the extracted features W when running it with theMNIST data set as an input.

Composing together many RBMs, making each hidden layer, the visible layer ofanother RMB we form a deep belief network, which is able to extract featuresin data of different levels of abstraction. To train a Deep Belief Network wewould proceed as follows:

Given a input data sample Xwe would train a restricted Boltzmann ma-chine on Xto obtain its weight matrix, W. Then we would use it as theweight matrix between the lower two layers of the network.

Then we would transform Xby the RBM to produce new data sampleX, either by sampling or by computing the mean activation of the hiddenunits.

Next we repeat the procedure with X X for the next pair of layers,until the top two layers of the network are reached.

At last we would fine-tune all the parameters of this deep architecturewith respect to a proxy for the DBN log-likelihood, or with respect toa supervised training criterion (after adding extra learning machinery toconvert the learned representation into supervised predictions, e.g. a linearclassifier).

20


21/36

2.4 Deep Dream

Have you ever thought so long in something or someone that have the feeling fora second you see it even when its not there? This is another curious applicationof Deep Neural Networks, the generation of images reminding to hallucinations.The idea in Deep Dream is to maximize the activations of certain layers featuresin a network that is already trained, and mix the detected features with an inputimage. To do this it is used gradient ascent, which is the opposite idea of gradientdescent, instead of subtracting gradient to weights and biases in each step weadd it, to get a higher activation value. The result of doing this on an image ofBarcelonas skyline from Parc Guell with an Inception NeuralNets layer thatdetects the presence of canines and other animals is this:

3 Methodology

Our goal was to build our own CNN, being inspired in examples that alreadyperform prediction effectively and understand their architecture. To do that we

21


22/36

use Tensorflow library and get ideas from its avalaible examples.

3.1 The dataset

The first proposed objective was to train a convolutional network to classify adataset of food pictures extracted from Instagram and Google Images in thefollowing 10 classes:

Beer (0)

Burger (1)

Coffee (2)

Croissant (3)

Fried Eggs (4)

Other (5)

Paella (6)

Pizza (7)

Sushi (8)

Wine (9)

Figure 8: A sample of our initial data set

The aim of the class other is to make the model able to tell if the picturedoesnt belong to any of the food classes. The dataset is composed both fromInstagram photos and web images. Instagram photos have been obtained fromthe Instagram API, filtered with user defined tags and manually purged. Asuser defined tags are very noisy this method proved to be inefficient and verytime-consuming. In order to facilitate the generation of more ground truthannotations and a larger training dataset we also obtained images from GoogleImages through the Custom Google Search API. This method, which allowed toautomatically annotate a bigger set of images, turned out to be very useful asalmost all the retrieved images showed the desired food category and minimummanual purge was required. The first model that we are going to build is going

22


23/36

to be a single layer neural network. The images of the dataset have no specific

size or format, we store them in a .bin file with records with information of32x32 pixels with 3 channels RGB. Then to feed the network we randomblycrop them into 24x24 pixel images, to expand the data set size.

3.2 Data augmentation

3.3 Hardware setup

The models were trained over a high-end server with a quadcore Intel i7-3820 at3.6 GHz with 64 GB of DDR3 RAM memory, and 4 NVIDIA Tesla K40 GPUcards with 12 GB of GDDR5 each, connected through a PCIe 3.0 in x16 mode(containing two PCIe switches). The machine runs a GNU/Linux system, withLinux kernel 3.12 and NVIDIA driver 340.24. We performed experiments with

different configurations (downscaling sizes, data augmentation, different num-ber and composition of layers, different layers geometry, etc.). We also testedaspects with no impact in the classification accuracy but with practical impli-cations such as different input formats (TFRecords, compressed numpy arrays,etc.) or different hardware configurations (one or more CPUs and GPUs, etc).Our runs include an extensive set of conffigurations; for brevity, when those pa-rameters were shown to be either irrelevant or to have negligible effect, we usedefault values. Each experimental conffiguration was repeated at least 5 times.Unless otherwise stated, we report median values in seconds.

3.4 Implemented and customized programs

Our program contains the following parts:

Data format transformer - folder2bin.py:This script reads images located in a folder containing as many subfoldersas classes of pictures (10) and returns a .bin file with all the image datastored in records of length n= picture width picture height n colorchannels + 1 bytes where the first byte is the class label of the pictureand the rest of bytes are the pixel intensities . We also implemented theopposite step in bin2folder.py.

Input reader - ManuNet input.py: This script contains functions to read.bindata files using a queue of image examples and returning tensors ofa given batch size, containing image arrays and separately labels. Also

contains a function to distort images randomly cropp, flip and whitenthem to enlarge the data set.

Model - ManuNet.py:The network implementation of ManuNet, a customized verion of CIFAR10 network[2]. This programs allows us to extract our data set from ourgithub repository (www.github.com/manucarbonell/datasets) and then builda convolutional network with different architectures to perform experi-ments. For each value of modewe build a network with:

1 fully connected sigmoid layer

23


24/36

1 convolutional layer followed by pooling

2 convolutionals with pooling and normalization 2 convolutionals with pooling and normalization followed by 2 fully

connected layers

Contains function to save summaries of the cost (value of the cost func-tion) evolution during the training process that can be observed later withTensorBoard, a platform to visualize the learning in an interactive way.In the end we didnt use TensorBoard since we preferred to generate ourcustom graphs and training and evaluation outputs. Also contains thefunction called during the training process that builds the model, andperforms the backpropagation algorithm with learning rate exponentialdecay.

As we said before, in each step of the backpropagation algorithm we updateweights and biases in the direction of the cost function gradient multipliedby a scalar, the learning rate. If we used a constant value for a learningrate, we would stop getting close to the cost function minimum very fastas we would pass through it, the same way it would be difficult to get theball in the whole only using the drive stick when playing golf. So aftera given number of epochs NUM EPOCHS PER DECAY we decay the learningrate exponentially using the factor LEARNING RATE DECAY FACTOR. In thefunction train() the training step is defined calculating loss results andapplying calculated gradient. The different kind of layer ops used are de-scribed in Tensorflow library https://www.tensorflow.org/versions/r0.8/api_docs/python/nn.html.

Network training - ManuNet train.py:This script calls the input reader, builds the network graph and calls thetraining step from the model program and iterates the process saving theresults in a file. The graph is saved in a file .ckptso it can be read whenevaluating the model. During the training, loss value, steps and executiontime are saved.

Train network and track precision - ManuNet train eval.py:A modified version of previous program that saves the prediction precisionusing both train and test data separately. Model inferences are groupedin a scope to allow reusing variables.

Evaluate model - ManuNet eval.py:

Returns the precision of our models inference over test data. Evaluation sample - ManuNet eval sample.py:

Performs model inference over the desired number of batches and savesthe images with their predicted and correct label.

Generate confusion matrix - ManuNet eval byclasses.py:This program performs inference of imagess labels over the desired num-ber of batches and returns a matrix where each entry aij is the portion ofexamples that where labeled as class i and predicted as class j, this wayvalues on the diagonal aii give the precision for class i.

24
https://www.tensorflow.org/versions/r0.8/api_docs/python/nn.htmlhttps://www.tensorflow.org/versions/r0.8/api_docs/python/nn.htmlhttps://www.tensorflow.org/versions/r0.8/api_docs/python/nn.htmlhttps://www.tensorflow.org/versions/r0.8/api_docs/python/nn.htmlhttps://www.tensorflow.org/versions/r0.8/api_docs/python/nn.html


25/36

Extract features - ManuNet get features.py:

Performs inference and saves extracted features in convolutional layer ker-nel as images.

4 Results

4.1 Single softmax layer network

The first and simplest possible approach that weve taken was to train the net-work with a single softmax layer, with a single matrix product of the image datawith the weights and addition of biases.

Figure 9: 1 softmax layer ANN

Since our dataset is quite noisy and not verylarge, without extracting features of differ-ent levels of abstraction, the first result isnot going to be very good, if it does learnsomething it could be considered already anachievement.

After running the experiment we can seethe results of running stochastic gradient de-scent with a single layer network in figure.Clearly we are not getting close to a mini-mum of the cost function, since after some

steps the cost is barely decreasing.When checking the precision of the predictions, feeding the network with

600 test images and dividing the correct classifications between the total clas-

sifications, we get a disaster score of 44% (44 out of 100 images are classifiedcorrectly).

Figure 10: Training loss in each step for a network with no hidden layers.

So only with a softmax layer receiving the weight product with the input layerplus biases, the model does learn something, since a random classification would

25


26/36

get around 10% precision, but it doesnt get much more better than that. Taking

a look at the loss value evolution we leave this helloworld experiment anddont spend time in doing experiments with more steps and go on with thelayers that give the name to the networks we are studying. So lets see howdoes the classification improve with a convolution and later pooling layer.

4.2 Convolution

After the simplest approach of having a network with no hidden layers we seehow does the model do with only one hidden convolutional layer.

Figure 11: Convolution network simple

structure

The layer is going to extract 64 5x5neurons kernels, with a stride of asingle neuron on each direction anda padding which will make the out-

put of the convolution layer have thesame size of the input layer. Wethink convolution layer as a prism sinceit extracts 64 features presence infor-mation for every part of the image,this information we save in a 3d-vector(24x24x64).

The loss value continues decreasing for a longer continued training (in sim-plest model it almost stopped decreasing in the first 10.000 steps) and ends uphaving an average value of 0.3. Training time was of 2 hours 56 minutes forcompleting 50.000 steps, which we choose as a training length sufficient for loss

stabilization. The predictions do as expected a jump in precision to 85.5% ofaccuracy in contrast with previous model result.

Figure 12: Loss function values during training, in steps and time

4.3 Convolution and pooling

Now we check the effect of adding a pooling layer after the convolutional layer.

Figure 13: Convolution network with

1 conv. layer and pooling

This time cost decreases quite faster andreaches close to 0.16 value after 50.000training steps in contrast with the 0.3 ofthe previous model. In terms of precisionwe get up to 88% goodness score, whichis actually not bad taking in account that

26


27/36

the images are not quite simple as for ex-

ample handwritten digits, and our net-work has only 2 layers. If we randomlychoose a sample of the classifications we see that effectively more than 8 out of10 images are classified in the right group. The real label of the picture is afterL: and the prediction computed by the network is after P:.

L: 0, P: 0 L: 1, P: 1 L: 2, P: 2 L: 3, P: 3 L: 4, P: 8

L: 5, P: 5 L: 6, P: 6 L: 7, P: 6 L: 8, P: 8 L: 9, P: 9

Also we can see the precision results in the confusion matrix, which tells us fromeach class, which portion of the predictions where correct and which where inwrong classes, where rows are real class labels and columns are class predictions.For example we can observe that 15% of the fried eggs where classified as sushi.That might be cause of the similarity of colors and shapes (white ovals sur-

rounded by black are present in both classes). So maybe we should rise thenumber of layers and features detected in our network, or change other param-eters to let the network tell the difference between those classes. A part of thistheres already no notable confusion (greater than 10%) between other classes.

Table 1: Confusion matrix

0 1 2 3 4 5 6 7 8 90 0.87 0.02 0.00 0.02 0.00 0.08 0.02 0.00 0.00 0.001 0.00 0.85 0.02 0.04 0.00 0.04 0.00 0.02 0.02 0.002 0.08 0.00 0.77 0.00 0.00 0.11 0.02 0.00 0.03 0.003 0.02 0.02 0.00 0.78 0.06 0.04 0.00 0.06 0.02 0.004 0.02 0.00 0.02 0.05 0.72 0.02 0.00 0.02 0.15 0.00

5 0.01 0.01 0.00 0.00 0.00 0.94 0.01 0.01 0.01 0.006 0.00 0.02 0.00 0.03 0.02 0.00 0.90 0.02 0.01 0.007 0.00 0.02 0.00 0.02 0.02 0.00 0.05 0.85 0.05 0.008 0.01 0.00 0.01 0.01 0.01 0.06 0.01 0.04 0.83 0.009 0.00 0.00 0.04 0.00 0.00 0.10 0.00 0.00 0.00 0.86

27


28/36

Figure 15: Training cost depending on time and steps

So taking a look at the classifications, we could say that with 1 convolutionaland pooling layer the network can already classifies images with a clear colorand shape pattern, but has difficulties with classes that contain many colors

and complicated shapes so clearly the idea of convolution together is the mainkey of the learning process of the network. In terms of time, it took us 2 hoursand 24 minutes to train the network using the GPU cluster.

It is to remark that in this model we are estimating a function with a totalof 32x32+5x5x64=2624 parameters, so we can imagine the complexity of thecomputation, this is what we meant by the differences between classical statisticsand machine learning.

4.4 Two convolutional and pooling layers

So now that we saw that the main improvement comes after having a convolu-tional layer, lets see what happens if we have another big improvement afteradding a second convolutional layer.

Figure 16: Convolution network with 2 convolution and pooling layers

Again we will recognize 64 features which resulted to be an good amountin our reference work [3]. Comparing the cost during the training with thenetwork with a single convolutional layer we can see that this time the learningcurve is not so steep from the beginning, it decreases in a more continuousrhythm during the whole training, and also with less oscillations. So we couldsay adding a convolutional layer gives us a more stable training process. Interms of precision, in the first 10.000 steps we get almost the same (0.87). Ifwe do the training process with the GPU version of Tensorflow we can raisethe number of steps to 50.000 to see the difference between 1 and 2 convolutionlayers.

28


29/36

Figure 17: Training cost depending on time and steps

Here we have the confusion matrix results when training a network with 2convolutional layers.

Table 2: Confusion matrix for network with 2 convolution and normalization layers

0 1 2 3 4 5 6 7 8 90 0.86 0.00 0.02 0.00 0.00 0.08 0.02 0.00 0.00 0.021 0.00 0.79 0.00 0.04 0.00 0.05 0.02 0.04 0.05 0.002 0.10 0.02 0.70 0.03 0.03 0.07 0.00 0.00 0.00 0.053 0.02 0.04 0.00 0.82 0.04 0.00 0.02 0.04 0.02 0.004 0.02 0.00 0.00 0.04 0.85 0.02 0.00 0.02 0.04 0.005 0.01 0.01 0.00 0.00 0.00 0.93 0.02 0.00 0.01 0.006 0.00 0.00 0.00 0.05 0.02 0.00 0.91 0.00 0.02 0.007 0.02 0.03 0.00 0.01 0.00 0.00 0.06 0.85 0.03 0.008 0.00 0.00 0.00 0.02 0.00 0.02 0.00 0.01 0.94 0.009 0.00 0.00 0.00 0.00 0.00 0.09 0.05 0.00 0.00 0.86

It takes 2h 51 minutes to train the network with 2 convolutional + poolinglayers, but the precision results are the almost the same. Theres some changesin the confusion, but the general precision score is still 88%, which make uswonder why in many widely used models theres repetition of convolutionallayers to improve the classifications. It is possible that our dataset is not bigenough to the difference to be noticed when adding repeated layers. Anotherparameter we used by reference of previous works is the cropping size, whichwas 32x32 for all images, but maybe using such small versions of the picturesis preventing our network to recognize some features that need more resolutionto be recognized.

4.5 Image size augmentation

So next approach will be to train the network with two convolutional, poolingand normalization layers with a higher resolution version of the same dataset.We choose a size that is not going to do the computation too slow but thedifference of resolution is quite noticeable, which is going to be 48x48 aftercropping step. With this size, the precision of the predictions rises to 89.99%,which is quite a significant improvement. So we can state than higher resolutionwith more convolutional layers gives better classification results. Also it takes ahigh computation time to proceed with these experiments. With the last one (2convolutionals, 48x48 pixels) it took 8 hours 21 minutes to complete the 50.000

29


30/36

steps of training. The confusion matrix show us some classes predictions now are

really highly accurate, but still some of them lack of a complete understandingof the patterns by the neural network.

0 1 2 3 4 5 6 7 8 90 0.83 0.00 0.04 0.02 0.02 0.07 0.02 0.00 0.00 0.001 0.00 0.82 0.00 0.06 0.00 0.04 0.00 0.04 0.04 0.002 0.02 0.02 0.88 0.00 0.00 0.08 0.00 0.00 0.00 0.003 0.00 0.00 0.02 0.83 0.04 0.00 0.04 0.02 0.04 0.004 0.00 0.00 0.05 0.02 0.81 0.02 0.00 0.02 0.07 0.005 0.01 0.00 0.00 0.00 0.00 0.96 0.01 0.00 0.01 0.006 0.00 0.02 0.02 0.02 0.02 0.02 0.86 0.05 0.00 0.007 0.02 0.03 0.00 0.02 0.00 0.02 0.00 0.90 0.02 0.008 0.00 0.00 0.01 0.01 0.00 0.00 0.01 0.00 0.96 0.009 0.00 0.00 0.04 0.00 0.00 0.09 0.00 0.00 0.00 0.87

Table 3: Confusion matrix of 2 convolution layers model trained with 48x48 images

4.6 Overfitting

The plotted loss values are calculated with train data, but it would be good tosee how does the classification precision evolves during the training process, toknow if the network actually learning concepts behind the data, or is it onlymemorizing the training data set.

Figure 18: Precision of classifications with training data set in green and test data set

in reed

To do that we take a look the evolution of the precision with training andtest data. In figure 18 we can see that theres a point when the precision ofthe classifications stops improving, that means that our network is overfitting,or some of the neurons are saturated. To avoid this we can try with addingnormalization layer.

30


31/36

4.7 Data amount augmentation

In all those training procedures, we used a routine implemented in CIFAR10model which enlarges the data set by randomly cropping a part of the image toget a new one. First we cropped 32x32 into 24x24 and then 64x64 into 24x24.We can see this has a positive effect, because if we run the network with the64x64 dataset without being cropped we get again down to 87% precision, soit indeed helped to have more examples distorting the original images.

4.8 Batch size

Initially we set a commonly used batch size, which is 128 examples/batch, butwe wanted to see how this does affect training time and precision of classification.We compare the training evolution and results for the last network model which

consisted of two convolutional layers each followed by pooling and normalization.In figure19we see that the loss takes slightly less oscillating and lower valuesfrom step 30.000 on. We can see it more clearly in a close up of the last 1.000steps of training in figure20.

Figure 19: Training loss with batch size of 128 examples in blue and 64 in red.

31


32/36

Figure 20: Training loss on 1000 last steps for batch size of 128 examples in blue and

64 in red.

The big difference comes for training time, half lower for lower batch size,we can see the comparison in minutes in figure21.

Figure 21: Training loss depending on time for batch sizes 128 in blue and 64 in red.

The precision of predictions stays at 89% in both cases, so the conclusionwould be that for our data set it is clearly better to set a smaller batch size suchas 64 examples per batch. In future work we could discuss to use other batchsizes, but this one seems to produce a good enough result.

4.9 Extracted features

Here are some extracted features by our network, in this case the 64 kernels ofthe first convolutional layer (5x5 weights). As it happens with CNNs we can notnow why a particular shape and kind of features is learned during the process,

32


33/36

but intuitively in some of them it looks like the network learns basic shapes to

recognize the boundary of the objects in the images.

Figure 22: Extracted kernels in first convolutional layer, in a model with 2 convolu-

tional layers.

For each of the previous features, we get a tensor of the shape of the image withthe activations for this tensor, as explained in section 1.5.1,since we are usinga padding to get the output with the same shape of the input.

33


34/36

Figure 23: Output of first convolution layer for the model 4.4

For example in figure ?? we can see the output of some images after convo-lution for the first extracted feature, for each color channel.

4.10 Retrain ImageNet ModelOne may ask himself, is it normal if I need to see 1000 images of an elephantbefore Im able to recognize it when I see another one again? Maybe in avery strange case in which you just had your sight given and an elephant isthe first thing you ever saw, otherwise it shouldnt be necessary. Apparentlyit happens the same with CNN learning. Once the network has learned manyvisual concepts its easier every time to learn new ones. This way after seeingsome results with a self trained CNN we move to this approach, which is goingto be to retrain a large ImageNet model to recognize the pictures of our dataset. This technique is called transfer learning or convolutional network finetuning. After knowing about Donaue et al. [5] work, and the option to load

34


35/36

Inception Neural Network with Tensorflow to use the learned features in your

own data set we checked if this is a better approach than to train a net onlywith our own data.

4.10.1 Bottlenecks

The idea consists of loading the graph of an ImageNet Inception network whichis already trained (concretely for 1000classes) and using the learned features,perform a training over the new classes to recognize, avoiding this way a longtraining process. To do that we have to adapt the last layer of the trained graphbefore softmax to the new added classes. Doing this using Imagenet Inceptionv-3 model provided by Tensorflow we achieve a precision of 91.2% in only 17minutes. So definitely we can state than knowing patterns of many image classesgreatly helps learning faster and better new classes as it happens with biological

neural networks.

5 Conclusions and future work

After different unexpected accuracy values and training times, we can say thatalthough some of convolutional neural networks insights are still not clear, theydo work quite good for image classification. For example we can not state thatdeeper networks are always going to give better classification results or thathigher image resolution is also going to do, it is going to be like this sometimesbut not always, as we saw on the results. Also after looking for documentation inthis aspect we saw many the parameters such as batch size or learning rate usedin state of the art networks are determined as we proceeded, by trial and error,

so it is still to solve why a specific number of layers and of which deepness worksbetter. Also we observed that as expected, using a previously trained net thatalready classifies 1000 classes gives much better accuracy for our dataset thanthe networks trained from the beginning only with our dataset, also consumingmuch less time. So now that object classification and recognition in images withCNN is close to being solved, we would continue by exploring other applicationssuch as online user behavior prediction, improving speech recognition or look inwhich other research areas this machine learning technique can also be useful.

35
http://image-net.org/challenges/LSVRC/2014/browse-synsetshttp://image-net.org/challenges/LSVRC/2014/browse-synsets


36/36

References

[1] McCulloch, Warren; Walter Pitts (1943). A Logical Calculus of Ideas Im-manent in Nervous Activity. Bulletin of Mathematical Biophysics

[2] Krizhevsky, Alex (2009). Learning multiple layers of features from tiny im-ages.

[3] A. Krizhevsky, I. Sutskever, G E. Hinton, Imagenet Classification with DeepConvolutional Neural Networks

[4] Werbos, P.J. (1975). Beyond Regression: New Tools for Prediction and Anal-ysis in the Behavioral Sciences.

[5] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric

Tzeng, Trevor Darrell (2013). A Deep Convolutional Activation Feature forGeneric Visual Recognition

[6] George Cybenko (1989). Approximation by Superpositions of a SigmoidalFunction

[7] Smolensky, Paul (1986). Chapter 6: Information Processing in DynamicalSystems: Foundations of Harmony Theory.

36

neural nets and deep learning

Documents