deep neural networks and fraud detection1150344/...the purpose of this thesis is to detect credit...

U.U.D.M. Project Report 2017:38

Examensarbete i matematik, 30 hpHandledare: Kaj Nyström Examinator: Erik EkströmOktober 2017

Department of MathematicsUppsala University

Deep neural networks and fraud detection

Yifei Lu

Deep neural networks and fraud

detection

Yifei Lu

Contents

1 Introduction 1

2 Deep Neural network 12.1 The artificial neuron . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Activation Function . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 Single-layer feedforward network . . . . . . . . . . . . . . . . . . 52.4 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.5 Deep neural network . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Estimation of deep neural network 8

4 Implementation 154.1 Deep learning commercial softwares . . . . . . . . . . . . . . . . 164.2 The TensorFlow programming model . . . . . . . . . . . . . . . . 16

4.2.1 Computational graph structure . . . . . . . . . . . . . . . 174.2.2 Execution model . . . . . . . . . . . . . . . . . . . . . . . 184.2.3 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . 194.2.4 Gradient Computation . . . . . . . . . . . . . . . . . . . . 204.2.5 Programming interface . . . . . . . . . . . . . . . . . . . . 214.2.6 TensorBoard . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.3 Comparison with other deep learning libraries . . . . . . . . . . . 22

5 Experiments 235.1 Logistic regression model . . . . . . . . . . . . . . . . . . . . . . 245.2 Neural network using Scikit-learn . . . . . . . . . . . . . . . . . . 255.3 TensorFlow implementation . . . . . . . . . . . . . . . . . . . . . 27

6 Conclusion 30

A Scikit-learn code 31

B TensorFlow code 32

1 Introduction

Machine learning is one subfield of artificial intelligence, which presents hu-man intelligence by machines. Deep learning as one subset of machine learningbecomes most popular research hotspot currently. It employs artificial neuralnetwork(ANN) that is an information process machine modeled on the struc-ture and action of biological neural network in the brain [1][2]. ANN is flexi-ble and self-adaptive to solve complex problems that are difficult described bymathematical model, such as pattern recognition and classification, function ap-proximation and control [3]. Recent years the increasing interest in deep neuralnetworks(DNNs), which employs many layers, has heightened the need for ANNin both industrial and academical areas. Deep neural networks learn experiencefrom data to approximate any nonlinear relations between the input informa-tion and the final output. A well-trained deep neural network has the ability tocapture abstract features over the entire data set.

The purpose of this thesis is to detect credit card fraud transactions byapplying deep neural networks. Applying back-propagation algorithm to findoptimal parameters, the network is trained to reach stability and be optimal sothat a appropriate model can be found to detect whether the transaction madeis normal or fraud. This problem of detecting fraud transactions can be regardedas a classification problem. The optimal network to perform classification oncredit card data set is explored and implemented in two open source machinelearning libraries, namely TensorFlow released by Google and Scikit-learn.

This thesis is organized in following way: the second chapter presents theoryof neural network’s architecture, from basic processing unit to the deep neuralnetwork with many layers. The third chapter introduces the theory of trainingthe network and related concepts. The fourth chapter presents the theory ofimplementation tool TensorFlow, from what is TensorFlow to how it works.In addition, we compare TensorFlow with other open source machine learninglibraries. The fifth chapter describes the experiments conducted on the dataset and implemented in two machine learning libraries as well as experimentresults.

2 Deep Neural network

The mechanism of artificial neuron is simulated on the neuron in the brain[4].Neuron is the basic element to process information in the brain. The brain con-sists of about 10 billions neurons that are interconnected to form the network.The biological neuron receives signals from other neurons and process on theinformation[5][6]. If the processed input signal exceeds a threshold value, theneuron fires and produces electric pulse to transfer the signals to other neurons;if the input signal is below the threshold value the neuron does not fire and nooutput signal is produced[7].

The following of this chapter based on the book Simon Haykin(2009)[4],presents the structure of artificial neurons and building blocks of neural net-works. Then we introduce the theory of different topologies of neural networks,

1

starting with single-layer neural networks then perceptron. The chapter endswith theory of deep neural network.

2.1 The artificial neuron

The topologies of artificial neuron network are determined by the interconnectedmanner between the basic process units called neurons. An artificial neuronis the basic computational unit which performs a nonlinear function on theinput signal[4]. Figure 1 shows the model of a neuron. The neuron is fed byinput signal x1, x2, ..., xm through a set of modifiable weights that can eitherstrength or dampen the specific input. A signal xj at the input of the linkj connecting to the neuron k is multiplied by the weight wkj . The input-weight products are summed as a net input of the activation function thatproduces output which are restricted into some finite value. In addition, a biasdenoted by bk is also added to the neuron as a component of the net input ofthe activation function. The bias plays an important role when designing thenetwork. It allows the parallel movement of activation function which increasesthe possibility of solving problems.

Figure 1. Nonlinear model of a neuron k.Source: Simon Haykin(2009,p.41)

The mathematical expression for the model of a neuron k is given by

uk =

m∑j=1

wkjxj (1)

andyk = ϕ(uk + bk) (2)

where wk1, wk2, ..., wkm are the connecting weights of neuron k; uk is the sum-mation of input-weights products; bk is the bias; ϕ(·) is the activation function;and yk is the output signal of the neuron k.

The bias bk also can be regarded as a modifiable weight equal to wk0 with afixed input of x0 = +1. Thus Eq.(1) can be written as follows:

vk =

m∑j=0

wkjxj (3)

2

thenyk = ϕ(vk) (4)

2.2 Activation Function

The activation function performs nonlinear transformation on the input signalfor the sake of controlling the activation to the output, hence an infinite rangeof input can be transfered into specific range of output value[8]. Several mostcommon activation functions are presented below:

1. Threshold Function A threshold function describing the output ofneuron k is given by

yk =

{1, if vk ≥ 0.

0, if vk < 0.

where vk is the net input of the neuron k:

vk =

m∑j=1

wkjxj + bk

For threshold function, shown by Fig.2, the output of the neuron takes value of1 if that neuron’s input is positive and 0 otherwise.

Figure 2. Threshold functionSource: en.wikibooks.org

2. Sigmoid Function Sigmoid function is the commonly used activationfunction which has a ”S” shape. It is a strictly monotonic increasing functionwhich shows good balance between linear and nonlinear property[4]. One ex-ample of the sigmoid function is the logistic function shown as Fig.3, definedby

f(v) =1

1 + e−av(5)

where a is the slope parameter of the sigmoid function and it equals 4a at the

origin. A logistic sigmoid function restrict the value of output to a continuousrange from 0 to 1 representing the probability of binary classification. It is

3

an acceptable mathematical representation of a biological neuron model whichshows if a neuron fires or not[8]. As a approaches to infinity, the sigmoid functionbecomes a threshold function. Moreover, sigmoid function is differentiable whichare essential properties for the theory of training the neural network[4]. Onefatal flaw of sigmoid function, however, is its saturation to ”kill” the gradient.When the output is at the tail of 0 or 1(saturation) the gradient of neuronapproaches to zero, which makes the network hard for training. The details areshown in next chapter.

Figure 3. Sigmoid functionSource: http://www.pirobot.org

3. Hyperbolic tangent function Hyperbolic tangent is more like to logis-tic sigmoid function except that it gives output from -1 to 1. As shown in Fig.4,it is bounded and differentiable. The mathematical definition of hyperbolictangent function is given by

f(v) = tanh(v) =ev − e−v

ev + e−v=e2v − 1

e2v + 1(6)

Figure 4. Hyperbolic tangent activation functionSource: http://computational.neutrino.xyz/

4. Rectified Linear Units(ReLU) Rectified Linear Units [9] are the newpopular trend used for training the deep neural networks in recent years[10].

4

The rectifier is linear when the input is positive and zero otherwise, defined as

f(v) = max(0, v)

where v is the input to a neuron. Compared with sigmoid activation function,ReLU solves training procedure faster due to its non-saturating nonlinearity[11].Furthermore, ReLU can be implemented by simply thresholding a matrix of ac-tivations at zero compared with sigmoid function which causes expensive com-putations.1

Another type of ReLU is softplus function, a smooth approximation to therectifier, defined by f(v) = ln(1 + ev).

Figure 5. Rectified Linear Units (ReLU) activation functionSource: https://en.wikipedia.org

2.3 Single-layer feedforward network

According to the topological structure of the connections between neurons, theneural network can be divided into different types. One is feedforward networkwith data entering from input layer transfered in one direction to output layerand no cycle between the neurons. Single-layer neural network is the simplestfeedforward network consisting of a single output layer meaning data going frominput layer directly to output layer through weights. Figure 6 shows a single-layer feedforward network.

Figure 6. Single-layer feedforward networkSource: http://lib.csdn.net

1S231n Convolutional Neural Networks for Visual Recognition http://cs231n.stanford.edu

5

2.4 Perceptron

The perceptron proposed by Rosenblattt(1958) is the simplest neural networkfor linearly seperable classification. The perceptron is a single neuron with ad-justable weights and bias. Rosenblatt has proved that the learning algorithm ofperceptron can be convergent to find out a hyperplane as a dicision boundaryto seperate the training examples that can be classified into two linearly sep-arable classes[4]. This algorithm is known as perceptron convergence theorem.By including more than one neuron in the output layer of the perceptron, themultiple perceptron can classify more classes which are linear separable.

The structure of the perceptron is shown as Figure 6. The neuron is con-nected to a set of inputs by corresponding adjustable weights. The summedinput-weight products is taken as input of threshold activation function. Theintroduced bias adds one more free parameter to make the network outputmore easily to reach the expected target. The output of perceptron is classified

according to the net input. If the net input vk =m∑i=1

wixi + b is positive the

neuron produces an output of +1, and -1 if it is negative. Therefore the decisionof classification can be presented by a hyperplane which is defined by

m∑i=1

wixi + b = 0 (7)

When the parameters of the network are determined, the trajectory ofm∑i=1

wixi + b = 0 can be drawn as a classification boundary in the space com-

posed of the input vectors x1, x2, ..., xm. For any given input vector throughthe weights and bias, it is either located above or below the hyperplane to beclassified. Figure 7 shows two-dimensional input vector space where the hyper-plane is a straight line. The points above the boundary line belong to Class 1and other points lying below the boundary line belong to class 2. The boundaryline can be shifted from the origin by the bias.

Figure 6. Signal-flow graph of the perceptronSource: http://sebastianraschka.com

To produce desired classification of the input vectors, the perceptron’s op-timal weights and bias can be learned from training iteratively on the dataset.As mentioned above, Roseblatt has proved that if the training examples are

6

picked from two linearly separable classes, the algorithm of updating weights isconvergent. Denote the time step of performing the algorithm as n. The weightsof the perceptron can be adjusted according to the error-correction rule whichis defined by

w(n+ 1) = w(n) + η[(d(n)− y(n)]x(n) (8)

where η is called learning rate which is a fixed parameter between 0 and 1, d(n)is the desired value and y(n) is the actual value. A single perceptron can onlyfunction on simple pattern classification for two classes entirely intersected bya hyperplane[4]. The nonlinearly separable pattern classification is out of thecomputational ability of perceptron.

Figure 7. Hyperplane as decision boundary for a two-dimensional, two-classpattern-classification problem

2.5 Deep neural network

Due to practical limitation of single-layer network on the linear separable prob-lem, deep neural network was introduced to solve an arbitrary classificationproblem[4]. It contains one or more hidden layers whose computational nodesare called hidden nodes. The depth of the model refers to the number of hiddenlayers. Figure 8 shows the topology of deep neural network with two hiddenlayers and an output layer. The input information enters the first hidden layerand the outputs of this layer are transfered as inputs to the second hidden layerand so on. Each layer receives outputs from previous layer as inputs, thus theinput signal propagates forward on layer-by-layer base until the output layer.An error signal is produced at neurons in the output layer which is propagatedbackward through the network.

Figure 8. Structure of deep neural network with two hidden layersSource: www.mql5.com

7

Each neuron in hidden and output layers performs two operations. One is tocompute the differentiable activation function on the input vectors and weightsand spread the output forward through the hidden layers. The second opera-tion is to compute the gradient of error with respect to weights connected tothe inputs of that neuron, which flows backward through the network.

Deep neural network can be used for pattern recognition and classification.Several key concepts are introduced below based on the website deeplearn-ing4j.org . The input layer consists of the components of a feature vector tobe classified. Each hidden layer of neurons processes on a different set of fea-tures that are output of the previous layer. The more hidden layers the networkhas, the more complicated features that can be detected by neurons since theycollect and combine information generated by previous layer. The nonlinear-ity from training data gives the network greater computational power and canimplement more complex functions than network without hidden neurons.

3 Estimation of deep neural network

This chapter introduces estimation of deep neural network based on SimonHaykin(2009). The learning process of neural network is to estimate the op-timal weights parameters through training on the data set to obtain expectedoutput. Training begins with initializing randomly small weights. To ensure thenetwork learn normally, the weights need to be initialized with different values.Choosing large weights would result in saturation of the network. In general, aneffective way to initialize weights is to randomly pick values from a uniform dis-tribution. Learning is then carried out with feeding input data into the networkto produce the output. The performance of neural network is measured by thedifference between the actual output and the desired output. This difference isdescribed as loss function which can be minimized by adjusting weights.

When the network learns from a data set whose category label is known, itemploys supervised learning. Each training example in the training set has apair of data, an input vector and corresponding expected output. The networklearns the function using labeled training set therefore it can map the unlabeledinput into the correct output to perform classification or regression[12][13]. Fortraining on the unlabeled dataset it is referred to unsupervised learning whichwill be not discussed here.

Consider a training set denoted as D={(x1,d1),...(xN ,dN )}is used to trainthe network with one or more hidden layers. Denote the output of neuron j as yjin the output layer produced by data x(n) in the input layer, the correspondingloss function produced by the output at neuron j is calculated by

ej(n) = dj(n)− yj(n) (9)

For convenience of taking derivative of loss function we add a factor of 12 to e

and summing the loss of all neurons in the output layer, then total loss of thewhole network is expressed by

En(w) =1

2

∑j∈C

e2j (n) (10)

8

where the set C contains all the neurons in the output layer. With the trainingset consisting of N examples, the average loss over the entire training set isdefined by

Eav(w) =1

N

N∑n=1

E(w) =1

2N

N∑n=1

∑j∈C

e2j (n) (11)

One important point is that E is a function of all the modifiable weights, i.e.,freeparameters of the network.

Batch learning and on-line learningThere are two different methods to implement supervised learning of neural net-work, namely batch learning and on-line learning[4]. Batch training, also calledstandard gradient descent, performs adjustments in weight space after present-ing all the examples in the training set that creates one epoch of training. Inother words, the loss function of batch learning is given by Eav(w) and thegradient is calculated for the entire training set. In the on-line method of su-pervised learning, each example is chosen randomly from the training set and astep of adjustment in the weight space is performed for each example presentedto the network. Therefore the loss function to be minimized is E(w).

In practice, batch learning leads to extra computations for the larger dataset, since there exit similar examples to compute gradients(described in the fol-lowing section)[14]. Therefore stochastic gradient descent(SGD) was proposedto compute gradient for each subset of the whole training set. Each step wetake M random samples from N training examples to compute the gradientthen update weights. Therefore the loss function to be optimized becomes:

Eav(w) =1

2M

M∑m=1

∑j∈C

e2j (m) (12)

where M in an integer from 1 to N . SGD is the most commonly used optimiza-tion method, which results in a faster training procedure and can also be usedas on-line learning[14].

Back-propagation algorithmBack-propagation algorithm is one of the simplest and most widely used meth-ods for training supervised neural networks when calculating gradient of theloss function. Gradient is the slope of the loss function in weights space show-ing that how the loss function varies as weights change. The loss(error) functionproduced by current weights is propagated backward through the network toupdate the weights. As the network learns from its mistakes for every trainingiteration, it adjusts weights iteratively to reduce the loss function.

As shown in Fig.9, we start searching on the loss surface from initial weightsthen move a step along the opposite direction of the gradient and the size ofstep is determined by both learning rate and the slope of the gradient[14]. Inmathematical way, the rule of updating weights is defined by

w = w − η∇Eav(w) (13)

9

where η is the learning rate, a fixed parameter between 0 and 1. The smallerlearning rate gives smaller changes in the weight space and longer training timeto reach the minimum of loss function. Larger learning rate, however, resultsin an unstable result. The gradient ∇E = ( ∂E∂w0

, ..., ∂E∂wn) is calculated using

back-propagation algorithm.

Consider neuron j fed by a set of inputs y1(n), y2(n), ..., ym(n) and the netinput of activation function produced at neuron j is

vj =

m∑i=0

wji(n)yi(n) (14)

where wj0 corresponding to the fixed input y0 = +1 equals the bias bj connectedto neuron j. Therefore the output of activation function f produced at neuronj is

yj(n) = fj(vj(n)) (15)

According to the chain rule we present this gradient as

∂E(n)

∂wji(n)=∂E(n)

∂ej(n)

∂ej(n)

∂yj(n)

∂yj(n)

∂vj(n)

∂vj(n)

∂wji(n)

= −ej(n)f ′j(vj(n))yi(n)

(16)

where we multiply the derivatives of loss of the whole network E(n) = 12e

2j , the

loss signal ej(n) = dj(n) − yj(n), the active function yj(n) and the net inputvj(n) .

Figure 9. Schematic of gradient descent including two weightsSource: http://www.webpages.ttu.edu

The adjustment ∆wji(n) to wji(n) is defined by the delta rule:

∆wji(n) = −η ∂E(n)

∂wji(n)(17)

where again η presents the learning rate, and the minus sigh shows that theweights are changed in a direction to reduce the loss. Plug Eq.(16) into Eq.(17)gives

∆wji(n) = ηδj(n)yi(n) (18)

10

where the local gradient δj(n) is given by

δj(n) = ej(n)f ′j(vj(n)) (19)

To calculate the weight adjustment ∆wji(n) the error signal ej(n) at the outputof neuron j is needed. There are two different cases based on whether neuron jis on the output layer or the hidden layer.Case 1 When neuron j is an output node, the error ej(n) of neuron j calculatedby Eq.(9) gives an explicit computation of local gradient δj(n) using Eq.(19).Case 2 When neuron j is in the hidden layer, the local gradient for output atneuron j is defined by

δj(n) = − ∂E(n)

∂yj(n)

∂yj(n)

∂vj(n)

= − ∂E(n)

∂yj(n)f ′j(vj(n))

(20)

where the second line uses Eq.(15). The error signal for a hidden neuron jcontributes to the errors of all neurons connected to j on the next layer, thus theloss is given by E(n) = 1

2

∑k∈C

e2k(n) where k is an output neuron. Differentiating

the loss function with respect to yj(n) gives

∂E(n)

∂yj(n)=

∑k

ek∂ek(n)

∂yj(n)

=∑k

ek∂ek(n)

∂vk(n)

∂vk(n)

∂yj(n)

(21)

The error at neuron k is

ek(n) = dk(n)− yk(n)

= dk(n)− fk(vk(n))(22)

Thus,∂ek(n)

∂vk(n)= −f ′k(vk(n)) (23)

The net input at neuron k is

vk(n) =

m∑j=0

wkj(n)yj(n) (24)

where m is the number of inputs excluding the bias entering into neuron k andwk0 = bk with the fixed input of value +1. Thus

∂vk(n)

∂yj(n)= wkj(n) (25)

By using Eq.(23),Eq.(25) in Eq.(21), we get

∂E(n)

∂yj(n)= −

∑k

ek(n)f ′k(vk(n))wkj(n)

= −∑k

δk(n)wkj(n)(26)

11

Using Eq.(26) in Eq.(20) we get the back-propagation formula for the localgradient δj(n) at hidden neuron j is given by

δj(n) = f ′j(vj(n))∑k

δk(n)wkj(n) (27)

To summarize the relationships derived from the back-propagation algo-rithm, first we note that the adjustment for weight ∆wji(n) is defined by thedelta rule:

∆wji(n) = ηδj(n)yi(n) (28)

The local gradient δj(n) is different depending on whether neuron j is in theoutput layer or hidden layer. If neuron j is an output node, the local gradientis calculated by δj(n) = ej(n)f ′j(vj(n)); if neuron j is a hidden node,δj(n) isgiven by δj(n) = f ′j(vj(n))

∑k

δk(n)wkj(n).

Forward and backward passesHaykin(2009) claimed that there are two different information transmissionwhen applying back-propagation algorithm, namely the forward and backwardpasses. In the forward pass, the input vector combined with the weights entersthe first hidden layer; then the output is transfered to the second layer as input.

The output for neuron j is calculated as yj(n) = f(m∑i=0

wji(n)yi(n)) where m

represents the number of neurons in previous layer; f is activation function ofneuron j; wji is the weight connecting neuron i to neuron j; yi is the input ofneuron j. If neuron j is in the first hidden layer, then yi represents the ithelement of input vector; if j is on the output layer, then yj is the j th outputneuron in the output layer. The input information is propagated layer by layeruntil the output layer to produce an error signal for each neuron at output layercalculated by the difference of output value and actual value. Note that theweights are fixed under the forward propagation.

The backward pass starts from the output layer by propagating the lossbackward through the network and calculating the local gradient δ for eachneuron recursively. The weights will be adjusted according to the delta rule ofEq.(28) recursively in the backward pass. If neuron j is on the output layerthe local gradient equals the product of the derivative of activation functioncorresponding to neuron j and the error signal at neuron j. Then updates forweights connecting to neurons in the output layer can be calculated directly byEq(28). Next we determine the local gradients for neurons in the second lastlayer according to Eq.(27) then update all weights fed into that neuron. Thechanges in weights are computed recursively and propagated backward until allweights are updated in the network. It is noted that the original weights areused to perform back-propagation before updating all weights in the network.

Numerical example of back-propagationWe use a numerical example shown in Figure 10 with two layers to illustrateback-propagation algorithm. The input vector is x = [0.1, 0.5] with desiredoutput d = 1. The sigmoid activation function is applied on hidden and out-

put layers and assume that learning rate η is 0.1. Denote w(l)ji is the weight

connecting neuron i to neuron j in layer l.

12

Figure 10. Illustration of back-propagation with two layers

The initial weight values is as follows:

The net input for the first neuron in the hidden layer:

net(1)1 = w1

11 ∗ x1 + w112 ∗ x2 + w1

10 ∗ b1net

(1)1 = 0.2 ∗ 0.1 + 0.3 ∗ 0.5 + 0.1 ∗ 1 = 0.27

We then apply sigmoid function to get the output:

out(1)1 = 1

1+e−0.27 = 0.567Carrying out the same process for the second neuron in the hidden layer we get:

out(1)2 = 1

1+e−0.05 = 0.512We repeat the process to the output layer neuron using outputs at hidden layeras input:

net(2) = w(2)11 ∗ out

(1)1 + w

(2)12 ∗ out

(1)2 + w

(2)10 ∗ b2

net(2) = −0.2 ∗ 0.567 + 0.1 ∗ 0.512 + 0.3 ∗ 1 = 0.2378out(2) = 1

1+e−0.2398 = 0.56

The error at output layer neuron is d − out(2) = 0.44. We know the derivativeof sigmoid function σ is σ ∗ (1− σ). The gradient for output layer neuron is

δ(2) = σ′(net(2)) ∗ (d− out(2)) = σ(net(2)) ∗ (1− σ(net(2))) ∗ (d− σ(net(2)))δ(2) = 0.56 ∗ (1− 0.56) ∗ 0.44 = 0.108

The error derivative of neuron in the output layer is ∂En

∂w(2)kj

= δ(2) ∗ out(1)j with

k = 1. Thus by the delta rule the updated weights connecting to the outputlayer neuron are

w(2)11 = −0.2− (0.1)(0.108)(0.567) = −0.206

w(2)12 = 0.1− (0.1)(0.108)(0.512) = 0.094

w(2)10 = 0.3− (0.1)(0.108) = 0.289

The gradient for hidden layer neurons are given by

δ(1)j = σ′(net

(1)j )

∑k

w(2)kj δk = [σ(net

(1)j )(1− σ(net

(1)j ))]

∑k

w(2)kj δk

δ(1)1 = (0.567)(1− 0.567)(−0.2)(0.108) = −0.0053

δ(1)2 = (0.512)(1− 0.512)(0.1)(0.108) = 0.0027

The error gradient for neurons in the hidden layer is given by ∂En

∂w(1)ji

= δ(1)j xi.

The updated weights of connections fed into hidden layer are

13

w(1)11 = 0.2− (0.1)(−0.0053)(0.1) = 0.20005

w(1)12 = 0.3− (0.1)(−0.0053)(0.5) = 0.30027

w(1)21 = −0.5− (0.1)(−0.0027)(0.1) = −0.49997

w(1)22 = 0.4− (0.1)(−0.0027)(0.5) = 0.40014

w(1)10 = 0.1− (0.1)(−0.0053) = 0.10053

w(1)20 = −0.1− (0.1)(−0.0027) = −0.09973

Rate of learningAs seen from the delta rule, the learning rate reflects the size of changes inweights. Smaller learning rate gives smaller change for weights between iter-ations which results in a smoother path of movement in the weights space.This,however, gives a longer training time to reach the local minimum on theerror surface. The larger learning rate gives a unstable result of the network.In order to avoid instability with increasing learning rate, the delta rule forupdating weights is modified by the generalized delta rule as follows:

∆wji(n) = α∆wji(n− 1) + ηδj(n)yi(n) (29)

where α is a positive number called the momentum constant, which describesthe adjustment of weight wji at iteration n and it at iteration n-1. Eq.(29) canbe seen as a time series with index t from the initial time to the current timen. By solving this equation we get:

∆wji(n) = η

n∑t=0

αn−tδj(t)yi(t) (30)

With the help of error derivatives ∂E(n)∂wji(n)

from previous equations we get

∆wji(n) = −ηn∑t=0

αn−t∂E(t)

∂wji(t)(31)

Early-Stopping of TrainingThe goal of training a neural network for classification problem is to obtain agood performance on the unseen data which has the same distribution as thetraining data in the same data set[15][16]. When a neural network learns train-ing data excessively, the network will concentrate on finding features that onlyexit in the training data rather than a true underlying pattern[4]. This lead toa poor generalization which is referred to overfitting[17].

To improve generalization and avoid overfitting, an optimal stopping timeof training is required. Because the weights are initialized with random val-ues, the error is large on the training set. As the amount of training iterationincreases, the error decreases dramatically then continues to decrease slowlybefore the network reaching the required minimum error of stopping training.However as the training error gradually decreasing to a stable value the erroron the unseen data increases. Therefore the optimal generalization usually ap-pears before training error reaching the local or global minimum. To find the

14

optimal stopping time, the data are divided into disjoint sets: a training set,an validation set not used for training and a test set. After every period, forinstance every three epochs, of training is done with fixed weights and bias,the network is then tested on the validation set. The error of each example inthe validation set is evaluated as validation error. Training is stopped when aminimum of validation error is reached then weights in the network are used assolutions. This procedure is called the early-stopping method of training whichis widely used in practice. Figure 11 shows estimation(training) learning curveand validation learning curve presenting average error of validation set.

Figure 11. Early-stopping training methodSource: Simon Haykin(2009,p.204)

L2 RegularizationAn excessive complex neural network, especially with too many parame-

ters, will lead to overfitiing [18]. To simplify the network for better general-ization, a regularization term is added to the loss(error) function to penalizethe complexity[19]. One widely used method is to penalize sum of the squareof weights, therefore the loss(error) function added with regularization term isdefined as[18]:

L(w) =1

2

m∑i=0

(di − yi)2 +λ

2m

∑w

w2 (32)

where w is neural network parameters, m is the number of training examplescontained in the training set, and λ is a regularization parameter to control thetradeoff between two terms in the loss function. This regularization approachis referred to L2 regularization(weight decay). Adding regularization term canlimit the range of parameters not too large therefore it can reduce overfittingto a certain extent.

4 Implementation

The neural network learning algorithms and model structure expressed in math-ematics need to be realized by computer programs for the real word use[20].

15

Therefore a series of open source commercial machine learning software librariesfor particularly deep learning which employs deep neural network with hugenumbers of hidden layers are invented. We will present several commercial soft-wares such as Torch, Theano, Caffe and Tensorflow, based on the article writtenby Peter Goldsborough(2016).

4.1 Deep learning commercial softwares

Torch released in 2002[21], is the oldest machine learning library for trainingdeep neural network. It can implement a variety range of advanced learningalgorithms in one integrated framework. Torch was originally implemented inC++, today it uses Lua language as frontend. It includes four basic classes:dataset, trainer, machine and measure. The Trainer sends input produced byDataset to Machine that can produce an output, which is used to modify theMachine. During the training process Measures can evaluate the performanceof the model.

Theano was released under the BSD license in 2008 and is developed by theLISA Group (now MILA) at the University of Montreal, Canada[22]. It is specif-ically designed for large-scale computations required for deep neural network.Theano is a mathematical compiler written in Python library. It defines math-ematical expression as symbolic representation, such as computational graph,which allows symbolic differentiation of intricate expressions. The expressionsare optimized and translated into C++, then compiled and computed on CPUor GPU devices efficiently. It is one of the pioneers and most popular amongdeep learning libraries.

Caffe released in 2014 under a BSD-License is an open source framework fordeep learning algorithm and maintained by the Berkeley Vision and LearningCenter[23]. The code is implemented in C++ with CUDA computed on GPUdevice. The computation task is designed on the basis of the network layers.Caffe especially is applied on training and deploying convolutional neural net-works(CNNs), and widely used for image recognition.

4.2 The TensorFlow programming model

One emerging open source deep learning software library is TensorFlow releasedby Google in 2015 which is used for defining, training and deploying deep neu-ral network. In this section we present the basic structure of TensorFlow andexplain the way how the machine learning algorithm is expressed by the com-putational graph in TensorFlow. Next we discuss the execution model for howto realize the computational graph on the basic process devices. Then we in-vestigate several algorithm optimizations built into TensorFlow according toboth software and hardware. Subsequently we discuss extension to the basicprogramming model. Lastly the programming interface and visualization toolof TensorFlow are introduced.

16

4.2.1 Computational graph structure

TensorFlow is a programming system using the computational or dataflow graphto represent computation task(learning algorithms). The computational graphconsists of nodes and directed edges connecting to nodes. The nodes representoperations and edges represent data flow between operations. The principlecomponents of computational graph, namely operations, tensors, variables andsessions are presented below.

1. Operations:In TensorFlow, nodes represent the operations. More precisely, nodes de-

scribe how the input data flow through them in the directed graph. An opera-tion can obtain zero or many inputs then produce zero or many outputs. Suchan operation can be a mathematical equation, a constant or variable. Figure3.1 shows examples of operations in Tensorflow. The constant is obtained by anoperation taking no inputs and producing the output the same as correspondingconstant. Similarly, a variable is an operation taking no inputs and producingthe current value of that variable. Any operation needs to be implemented byits kernel that can be executed on hardware device such as a CPU or GPU.

Figure 3.1 Examples of operation in TensorFlowSource: Peter Goldsborough(2016)

2.Tensors:In TensorFlow, the data are represented by tensors flowing between nodes

in the computational graph. A tensor is a multi-dimensional array with a statictype and dynamical dimensions. The number of dimensions of a tensor is re-ferred as its rank. A tensor’s shape describes the number of components in eachdimension. In the computational graph, when creating an operation, a tensoris returned which will be sent by an directed edge as input to the connectedoperation.

3.Variables:When performing stochastic gradient descent, the computational graph of

the neural network is executed iteratively for single experiment. Across evalu-ations of training, most tensors do not survive whereas the state of the modelsuch as weights and bias need to be maintained. Hence variables are added tothe computational graph as special operations. The variables saving tensors arepersistently stored in in-memory buffers. The value of variables can be loadedwhen training and evaluating the model. When creating a variable, it needs tobe supplied with a tensor as its initial value upon execution. The shape anddata type of that tensor automatically becomes the variable’s shape and type.The initialization of variables mush be executed before training. This can bedone with adding an operation to initialize all variables and executing it beforetraining the network.

4.Sessions:

17

The execution of operations and evaluation of tensors are performed in thecontext of session. The Session uses a Run routine as entry for executingthe computational graph. With invocation of run, the input is taken into thecomputational process of entire graph to return output according to the graphdefinition. The computation graph will be executed repeatedly for training thenetwork with invoking Run routine. The session distributes the operations ofthe graph to devices such as CPU or GPU on different machines according toTensorFlow’s placement algorithm which will be presented later. In addition, theordering of node executions are defined explicitly, namely control dependencies.Evaluating the model ensures that the control dependencies are maintained.

4.2.2 Execution model

The task of executing the computational graph is divided into four groups: theclient, the master, workers and a set of devices. The client sends the request ofexecuting the graph via run routine to the master process, which is responsiblefor assigning tasks to a set of workers. Each worker is responsible for moni-toring one or more devices which are physical entities to implement the kernelsof an operation. Based on the numbers of machine,there are two ”version” ofTensorFlow, namely local and distributed implementation. Local system is thatthe nodes are executed on a single machine; distributed system is to implementnodes on many machines with many devices.

Devices are the basic and smallest physical units in TensorFlow to executethe operations. The kernel of nodes will be assigned to available devices such asCPU or GPU to be executed. Furthermore, TensorFlow allows new physical im-plementation units registered by users. To monitor the operations on a device,each worker process is responsible for one or more devices on a single machine.The device’s name is determined by its type and the index of the worker groupwhere it is located.

Placement Algorithm is to determine which nodes will be assigned to whichdevice. The algorithm simulates execution of the graph from the input tensorto the output tensor. The algorithm adopts a cost model Cv(d) to determineon which device D = {d1, ..., dn} to execute a particular operation. The opti-

mal device is determined by the minimum of cost model d=argmind∈DCv(d) toplace a given node during the training.

Cross-Device execution: Tensorflow usually assigns nodes to different devicesas long as the user’s system provides multiple devices. This process classifiesnodes then assigns nodes belonging to the same class to one device. Thereforeit is necessary to deal with the problem of node dependencies allocated acrossdevices. Consider two such devices where nodes A is on device A. If the tensorv produced by operation A is transmitted as inputs to two different operationsα and β which are on device B, then there exit edges v � α and v � β acrossfrom device A to B, as shown in figure 2(a):

18

Figure 3.2. Three stages of cross-device communication between graph nodesin TensorFlow.

Source: Peter Goldsborough(2016)

In practice, the process of transmitting output tensor of v from device A,suchas GPU, to device B such as CPU are completed by producing two kinds ofnodes, namely send and recv, shown in figure 2(b). At last TensorFlow uses”canonicalization” to optimize (send,recv) pairs which is an equivalent but moreeffective approach compared with the example shown above. The two recv nodesthat connect to α and β on device B are replaced by only one recv node receiv-ing the output from v transmitted to two dependent nodes α and β.

4.2.3 Optimizations

To ensure the efficiency and performance of TensorFlow execution model, someoptimizations are constructed in the library. Common subgraph eliminationperformed in TensorFlow is to canonicalize the same type of operations on anidentical input tensor to a single one operation when traversing the compu-tational graph. The output tensor is then transfered to all dependent nodes.Another optimization, namely scheduling, is to perform nodes as late as possiblewhich ensures that the result of operations remains for the required minimumtime in memory. This effectively reduces the memory consumption and improvesthe performance of the model. Loss compression optimization refers to add con-version nodes to the graph. An optimized robust model does not change theresponse output due to the variation in noise signals. Therefore, the require-ment for the precision arithmetic of the algorithm is reduced. Based on thisprinciple, when the data communicate across the devices or machines, a 32-bitfloating-point representation at the sender is truncated into a 16-bit representa-tion by the conversion node, and then converted to 32-bit at the receiving endby simply adding zeros.

19

4.2.4 Gradient Computation

In this section we describe one advanced feature that is extent to the basic Ten-sorFlow programming model which was presented in section 3.2.1

The machine learning algorithms require to calculate the gradient of specificnodes with respect to one or many nodes in the computational graph. In theneural network, the gradient of cost with respect to weights should be computedgiven the training example which is fed into the network. The back-propagationalgorithm which is discussed in Chapter 2 is used to compute the gradient re-versely starting from the end of the graph.

There are two methods calculating back-propagating gradients through acomputational graph in [24]. The first is referred to symbol-to-number differ-entiation. Inputs are propagated forward through the graph to compute costfunction, then gradient is explicitly calculated via chain rule in opposite di-rection through the graph. Another method more adopted in TensorFlow isnamely symbol-to-symbol derivatives which computes gradient automaticallyrather than explicitly applying back-propagation algorithm. In such a way, spe-cial nodes are added to the computational graph to compute the gradient ofeach operation included in the chain. To realize back-propagation algorithm,the execution of gradient nodes are just like other nodes by launching the graphevaluation engine. This method provides a symbol handle to compute deriva-tives instead of calculating derivatives as numerical values. It can be explainedspecifically as follows:

The gradient of a particular node v with respect to other tensor α is com-puted backward from v to α through the graph. This extends the graph byadding the gradient node to each operation o which is a function of α encoun-tered in the chain (v ◦ · · · ◦ o ◦ · · · )(α) producing the output tensor. HenceTensorFlow adds the gradient node for each such operation o by multiplyingthe derivative of its outer function with its own derivative. The procedure isreversely calculated to the end node that producing a symbolic handle to thedesired gradient dv

dα which implicitly performs back-propagation method. Notethat the symbol-to-symbol differentiation is just another operation without anyexception.

In [25], the symbol-to-symbol differentiation may produce considerable com-putational cost as well as increasing memory overhead. The reason is that thereexits two different equations for applying chain rule. The first equation reusesthe previous computations which requires longer time to be stored than neededfor the forward propagation. The chain rule is applied shown below:

df

dw= f ′(y) · g′(x) · h′(w) (33)

with y = g(x), x = h(x). The alternative way to express chain rule is given byequation (2):

df

dw= f ′(g(h(w))) · g′(h(w)) · h′(w) (34)

It shows that each function needs to recalculate all of its arguments and invokes

20

every inner function it depends on. Currently TensorFlow applies the first ap-proach based on the article [25]. Considering a chain that has thousands ofnodes, then recomputing the inner-most function for almost every operation inthe link seems not sensible. On the other hand, it is not optimal to store tensorsfor long time on the device, especially such as GPU which memory resource isscarce. For equation (2), theoretically the tensor stored on the device is freedas soon as it is processed by the graph dependencies. Therefore according tothe development team of TensorFlow [25], recomputing some tensor rather thanstore them on the device may be improved in the future work.

4.2.5 Programming interface

After discussing TensorFlow’s computational model, we focus on more practicalprogramming interface. We will discuss the available language interface andsummarize the high-level abstraction of Tensorflow API and how to quickly cre-ate machine learning algorithm prototypes.

TensorFlow supports C++ and Python programming languages that allowusers to call backend functionality. Currently, TensorFlow’s Python program-ming interface is easier to use, since it provides a variety of functions to simplifyand complete the construction and execution of computational graph that havenot yet been supported by C++. It is important to note that Tensorflow’s APIand NumPy which is a library for scientific computing with Python are wellintegrated. Thus, we can see that TensorFlow tensors and NumPy ndarrays areinterchangeable in many situations.

TensorFlow program consists of two distinct phases: a construction phaseand an execution phase. In the construction phase, the operations in compu-tational graph is created to represent the structure and learning algorithms ofthe neural network. Then in the execution phase the operations in the graphare executed repeatedly to train the network.

For the deep neural network with huge number of hidden layers, the stepsof creating weights and bias, computing matrix multiplication and addition, ap-plying nonlinear activation function are not efficient. Therefore a number ofopen source libraries are proposed to abstract and package these steps as wellas building high-level blocks such as entire layers at one time. One abstractionlibrary is PrettyTensor developed by Google which can provide high-level inter-face to the TensorFlow API. It allows the user to wrap the Tensorflow operationsand tensors into a clean version, and then quickly connect to any layers in series.TFLearn is another abstraction library built based on the TensorFlow, whichcan be used mixed with TensorFlow code. It provides highly packaged networklayers to quickly construct computational graph and allows for chaining networklayers rapidly. Moreover, compared with PrettyTensor using tf.Session to trainthe model, TFLearn can directly add training examples and corresponding la-bels to train a model easily.

21

4.2.6 TensorBoard

Deep learning usually employs complex networks. In order to apply and debugsuch a complicated network, a strong visualization tool is required. TensorBoardis a web interface that visualizes and manipulates the computational graph builtin TensorFlow. The main feature of TensorBoard is to construct a clear and or-ganized visual interface for the computational graph with complicated structureand many layers to make it easily to understand how the data flows throughthe graph. One important visual classification method in Tensorflow is Namescopes. It can group the operations, the input, output and relationships belong-ing to one name scope into one block such as a single network layer, which canbe expanded interactively to show details of the block.

Furthermore TensorBoard can trace changes of a single tensor during thetraining. The summary operations are added to the nodes of the computationalgraph to produce the report of the tensor’s values. TensorBoard web interfacecan be interacted in a way where once the computational graph is uploaded onecan observe the model as well as monitor the operations.

4.3 Comparison with other deep learning libraries

In this section we compare TensorFlow with other machine learning frameworkswhich are discussed in section 3.1.

Theano: Among the three alternative libraries, Theano is the most similar asTensorFlow. Theano’s programming model is declarative instead of imperativebased on the computational graph. Also it adopts symbolic differentiation. Asit applies more advanced optimization algorithms on the graph whereas Ten-sorFlow only executes common subgraph elimination, Theano has longer graphcompile time. Moreover Theano’s visualization interface is poor than Tensor-Board. Lastly, Theano is lack of the ability of distributed execution on thecomputational graph which is supported by TensorFlow.

Torch: The main difference between Torch and TensorFlow is that Torchhas a C/CUDA backen and uses Lua as its frontend which is not a mainstreamprogramming language compared to Python. This leads to a harder applicationin the industry than Python-based TensorFlow. In addition Torch’s program-ming model is different from TensorFlow in a way that Torch adopts imperativeinstead of declarative, which implies that Torch performs symbol-to-number dif-ferentiation to compute gradients for optimizing the model.

Caffe: Caffe is significantly different in many aspects compared with Tensor-Flow. It uses MATLAB and Python as frontend to construct model. The basicbuilding unit of Caffe is entire network layers rather than operations in Ten-sorFlow. Similarly as Torch, Caffe has no focus on th construction of the com-putational graph which results in computing gradients using symbol-to-numbermethod. Moreover, Caffe is especially suitable for the convolutional neuralnetworks and image recognition, whereas it is not robustness as TensorFlow inthe area of neural networks. Lastly Caffe can not support distributed execution.

22

5 Experiments

In this section we conduct experiments for detecting credit card fraud transac-tions and present the results. The data set of credit card transactions was madeby European cardholders occurred in two days in September 2013. It contains284,807 transactions where 492 of them are frauds transactions, which accountsfor 0.172% of all transactions.

The description of the credit card dataset is given on www.kaggle.com. Thedataset consists of thirty features that are numerical values already transformedby principle component analysis(PCA) to reduce the dimension of feature space.For the sake of information security, the original features of consumers are notgiven. Features V1, V2, ... V28 are the principal components obtained withPCA. Two features ’Time’ and ’Amount’ are not transformed from PCA. Fea-ture ’Time’ represents the time elapsed between each transaction and the firsttransaction in the dataset. Feature ’Amount’ is the amount of the transaction.Label ’Class’ is the target label with value 1 representing the fraud case and 0representing the normal case.

Standardization of the feature space is a common requirement for trainingthe network effectively since the neural network is sensitive to the way how theinput vectors are scaled [26]. In many dataset the features are different scaledand have different range of values which results in longer training time for thenetwork to converge. Therefore it is helpful to standardize all the features in away that each feature is transformed to standard normally distribution by re-moving the mean value of each feature, then scaling it by dividing non-constantfeatures by their standard deviations.

In the data set we consider fraud transaction as positive class and normaltransaction as negative class. There are four predictive outcomes which canbe formulated in 2 × 2 confusion matrix as shown in Table 5.1. The columnrepresents predicted label and row represents the true label. TN is the numberof normal transactions correctly classified(True Negative), FP is the numberof normal transactions incorrectly classified as fraud(False Positive), FN is thenumber of fraud transactions incorrectly classified as normal(False Negative)and TP is the number of fraud transactions correctly classified(True Positive).

The performance of the neural network learning algorithm is normally evalu-ated using predictive accuracy[27], which is defined as Accuracy=(TP+TN)/(TP+FP + TN + FN). Unfortunately such a simple predictive accuracy is not anappropriate measure since the distribution of normal and fraud transactions inthe data set is extremely imbalanced. A simple default strategy of classifying atransaction as normal transaction yields a very high accuracy. Despite the highaccuracy, the model is lack of the ability to detect any fraud transactions withinall transactions. Our goal of training the network is to obtain a fairly high rateof detecting fraud transactions correctly which allows a lower error rate in thenormal transactions to achieve this [27]. Therefore simple predictive accuracy

23

is obviously not an effective metric in such case. The Receiver Operating Char-acteristic (ROC) curve is a standard technique to evaluate the performance ofmodel for binary classification.

Table 5.1 Confusion matrix of classifying normal and fraud transactions

A ROC curve describes performances of the model, which is represented bythe relationship between true positive rate(TPR) and false positive rate(FPR)over a range of decision thresholds[28]. On ROC space the X-axis is defined as%FP =FP/(TN+FP)(sensitivity) and Y-axis is defined as %TP =TP/(TP+FN)(1-specificity). The best prediction occurs at the point (0,1) in the ROC space,representing all positive classes are classified correctly and no negative classesare classified as positive classes. A random prediction produces a point alongthe diagonal line y = x. The closer ROC curve is to the upper left corner, thehigher accuracy the model obtains. Therefore ROC curves can easily presentcomparison of performances between different models. Area Under the ROCCurve (AUC) is an effective metric to summarize the performance of a ROCcurve. Generally the ROC curve with larger AUC reveals a better performancemodel.

There exits other evaluation metrics derived based on the confusion matrixshown in Table 5.1. Here we choose to present one particular evaluation metricwhich is called recall, defined as %TP =TP/(TP+FN). Recall is the percentageof true positive classes that are detected correctly by the model among totalpositives classes. It is used to compare different models when performance ofdetecting positive(minority) class is preferred[29].

Logistic regression is chosen as the benchmark model, which can be regardedas the simplest neural network if sigmoid activation function is used[30]. Thenwe construct neural networks with complex structure to investigate if we can ob-tain better performance than using logistic regression model. The experimentsare implemented in Scikit-learn and TensorFlow respectively.

5.1 Logistic regression model

Logistic regression is a regression method for predicting a binary dependentvariable which takes value 0 or 1. Considering an input vector with n inde-pendent variables x = {x1, x2, ...xn}, the conditional probability of dependentvariable being class 1 is defined as

P (y = 1|x) = π(x) =1

1 + e−g(x)(35)

where 11+e−g(x) is the logistic function or sigmoid function, g(x) = w0 + w1x1 +

· · ·+wnxn. Assume there are m observed dependent variables y1, y2, ...ym, the

24

probability distribution of dependent variables is given by

P (Yi = yi) =

{π(xi)

yi(1− π(xi))1−yi , yi=0 or 1.

0, otherwise.

The likelihood function is the product of these probabilities as shown below:

L(w) =

m∏i=1

π(xi)yi(1− π(xi))

1−yi (36)

The optimal parameters w0, w1, ...wn are estimated via maximizing the like-lihood function which takes logarithm of L(w). The loss function of logisticregression to be minimized is simplified as following:

C(w) = − 1

m

m∑i=1

(yilog[π(xi)] + (1− yi)log[1− π(xi)]) (37)

In logistic regression, the model complexity is already low which makes lessoverfitting for training[30]. The result implemented in Scikit-learn is shownbelow:

5.2 Neural network using Scikit-learn

In this section deep neural networks are constructed and implemented in Scikit-learn, which is an open source machine learning commercial library. The dataset is split into a training set and a test set. The learning of network is performedon the training set and the performance of the network is evaluated on the testset. Due to the imbalanced data, we train the network without re-sampling andwith re-sampling on the training set to investigate the impact of re-sampling onperformance of the network.

The designing of neural network topology is the critical factor of affectingthe accuracy of classification system. Adding hidden nodes can increase theaccuracy of the network, however, the excessive hidden nodes will cause over-fitting problem, which has negative impact on the generalization leading todeviations in prediction; therefore improving the accuracy and generalizationrequires suitable numbers of hidden nodes [31]. There has not been formal the-ory in determination of numbers of hidden nodes. The recommendation is basedon previous experience and repeated experiments.

In Larochelle et al.,2009 [32], it states that the network with the best per-formance is the one with same number of nodes in each hidden layer. In theexperiments we test with different number of nodes in the hidden layers and alsofind some structures give worse results or as good as the structure with equalnumber of nodes in the hidden layers. Therefore we adopt the same number ofnodes, such as 5, 15 and 20, in the hidden layers and do experiments startingfrom small network with one hidden layer then expand the network layer by

25

layer until 5 hidden layers. We performed test and found the network having 4hidden layers with 20 nodes in each hidden layer produced better result. Thiswas trained with learning rate 0.001 for 400 iterations and L2 regularizationparameter 0.1.

The result for prediction on the test set without oversampling is shown below:

To balance the data, oversampling was performed on the training set toincrease the minority class to make it equal to the majority class. In [33],one approach is Synthetic Minority Over-sampling Technique(SMOTE), whichoversample the minority class by creating ”synthetic” examples. For each fraudsample x, synthetic examples are introduced along the lines joining to the knearest fraud samples neighbors. According to the unbalanced ratio of thedata set we decide the amount of required over-sampling N . Therefore only Nneighbors x from the k nearest neighbors are randomly chosen and samples aregenerated in the direction of each. Each x is respectively combined with originalsample to generate new sample as following formula:

xnew = x+ rand(0, 1)× (x− x) (38)

This creates a new sample along the line segment between two features xand x. The prediction result with oversampling is given below:

We also found another neural network with 5 hidden layers and 20 nodes ineach layer obtain good results, given learning rate 0.1 and maximum iteration400. The result without oversampling is shown below:

The result with oversampling is shown below:

As seen from above two networks’ results, we found that the results us-ing oversampling on the training set are better than the results without usingoversampling method. Following this, it is concluded that oversampling usingSMOTE on the imbalanced training set can improve the overall performanceof the network to detect the fraud transactions correctly. Comparing two neu-ral networks’ results using oversampling method, the second network obtainshigher recall on the test set, whereas the number of normal transactions beenclassified as fraud transactions are larger than the result of the first network.

26

Since the cost of processing fraud transactions are considerable, the situationof classifying normal transactions as fraud transactions will lead to unnecessarycost. Therefore we choose the first neural network with four hidden layers toconduct the following experiments.

After deciding the network topology we then inspect how learning rate af-fects the performance of the network. We found that L2=0.1 gave the bestresults with respect to recall; therefore we kept this value constant throughoutthe experiments. We tried different learning rates with maximum iterations 80.The results are given as below:

Figure 4.1 The impact of learning rates on results

In Scikit-learn when the training loss did not improve more than by at leasttolerance for two consecutive epochs, the network is considered to be conver-gent and training stops. This is the reason that some curves are shorter inFigure 4.1. As seen from the graph, larger learning rates increase the speed ofnetwork convergence, whereas cause instability of the network. Small learningrates lead to slow convergence which make the convergent path more smoothly.Furthermore the loss is smaller when using small learning rates.

5.3 TensorFlow implementation

The TensorFlow code was inspired by the code developer from www.kaggle.com[34]. I used part of the original code and modified it.

The dataset is split into training set, validation set and test set. The train-ing set is balanced by duplicating fraud transactions to make the number oftwo classes equality. The structure of the network is the same as in Scikit-learnimplementation which has 4 hidden layers with 20 nodes in each layer. Wechoose learning parameters by testing against validation accuracy. We foundthat the network converges at epoch=200 and learning rate=0.005 with respectto the validation accuracy. Figure 4.2 shows that the result of training accuracy,validation accuracy, training cost and validation cost. The average training ac-curacy= 0.99765, average validation accuracy= 0.99695, average training cost=

27

18226.50195, average validation cost= 850.31555. Then we predicted on the testset and obtained recall of 86.89%.2 Compared with Scikit-learn implementation,both implementations obtained recall in the same range.

Figure 4.2 Results for training accuracy(blue), validation accuracy(green),training cost and validation cost obtained from the neural network given

epoch=1000, learning rate=0.005

Next we try different learning rates to investigate its impact on the exper-iment results given epoch=500 throughout the experiments. The results areshown as Figure 4.3 and Table 4.1.

Table 4.1 Average validation accuracy for different learning rates

Figure 4.3 validation accuracy for different learning rates

As seen from Figure 4.3, larger learning rate can speed up the convergence;however it is easily to overshoot the local minimum of loss function which leadsto oscillations. This instability of the network will have negative impact on theaccuracy performance. In practice when approaching the optimal solution we

2This is the result of one calibration. In fact each time the optimization routine can finddifferent local minimums to obtain different weights which leads to changes in results.

28

can use small learning rates since it makes the convergent path more smoothlyto ensure convergence to local minimum.

Next we try different weight initializations to investigate how it affects theresults. The weights should be random values which are small enough closed tozero but not identically zero. In previous experiments the weights were initial-ized using tf.truncated normal which returns values from a normal distributionwith specified mean and standard deviation except those values that are largerthan two standard deviation from the mean are dropped to regenerate3. Thepurpose for using truncated normal is to avoid saturation of neuron adoptingsigmoid active function.

Another similar option used in the experiment is tf.random normal whichgenerates random values from a normal distribution. The weight vector of eachneuron is initialized to a random vector, and these random vectors obey a multi-variate Gaussian distribution, so that all neurons in the input space are randomsmall.With this formula, we chose standard deviation equal to 0.01 and 1 forexperiments.

Based on the article Xavier Glorot, Yoshua Bengio(2010), one of popularmethods for initializing weights is called normalized initialization:

W ∼ U(−√

6√nj + nj+1

,

√6

√nj + nj+1

) (39)

where nj is the number of neurons in jth layer. This method ensures therequirement of largest gradient and activation variance[35].

The results of validation accuracy with different weight initialization is shownas below:

Figure 4.4 Results with different weight initialization given learningrate=0.005.

As seen from Figure 4.4, good initialization methods are usually only tospeed up the learning or convergent speed.

3An open-source software library for Machine Intelligence:TensorFlow 2015.https://www.tensorflow.org

29

6 Conclusion

In this thesis we discussed the theory of neural network’s structure and back-propagation method for learning optimal parameters. Then we introduced themachine learning software TensorFlow and its working mechanism. Next weapplied neural network to conduct experiments on credit card fraud data setand compare different models based on the evaluation metric recall. We chooselogistic regression as the benchmark model, which yields 96.04% recall demon-strating a better performance result on the test set than neural network. Thenwe tested different neural networks with different number of hidden layers. Bothimplementations in Scikit-learn and TensorFlow obtain consistent results. Theexperiments demonstrated that increasing hidden layers did not improve theclassification performance significantly. The network having one hidden layerwith several neurons can obtain similar predictive result as the network withseveral hidden layers. Furthermore the results show that the use of re-samplingimbalanced training set can increase the performance of the network on the testset.

30

A Scikit-learn code

import pandas as pdimport matp lo t l i b . pyplot as p l timport numpy as npimport t en so r f l ow as t ffrom s k l ea rn . m o d e l s e l e c t i o n import t r a i n t e s t s p l i tfrom s k l ea rn . neura l network import MLPClass i f i erfrom s k l ea rn . met r i c s import con fu s i on matr ixfrom imblearn . over sampl ing import SMOTEfrom s k l ea rn import p r e p r o c e s s i n g

data=pd . r ead c sv ( ’ c r e d i t c a r d . csv ’ )

columns=data . columns# The l a b e l s are in the l a s t column ( ’ Class ’ ) . Simply

remove i t to obta in f e a t u r e s columnsf ea tu re s co lumns=columns . d e l e t e ( l en ( columns )−1)

f e a t u r e s=data [ f ea tu re s co lumns ]l a b e l s=data [ ’ Class ’ ]

X sca led=p r e p r o c e s s i n g . s c a l e ( f e a t u r e s )f e a t u r e s t r a i n , f e a t u r e s t e s t , l a b e l s t r a i n , l a b e l s t e s t =

t r a i n t e s t s p l i t ( X scaled , l a b e l s ,t e s t s i z e =0.2 , random state =0)

oversampler=SMOTE( random state =0)o s f e a t u r e s , o s l a b e l s=oversampler . f i t s a m p l e (

f e a t u r e s t r a i n , l a b e l s t r a i n )

c l f = MLPClass i f i er ( s o l v e r =’adam ’ , alpha =1,h i d d e n l a y e r s i z e s =(20 ,20 ,20 ,20) ,l e a r n i n g r a t e i n i t =0.05 ,verbose =10, max i ter =400 , random state

=1)

#c l f . f i t ( f e a t u r e s t r a i n , l a b e l s t r a i n )c l f . f i t ( o s f e a t u r e s , o s l a b e l s )

#pr in t (” Train ing s e t s co r e : %f ” % c l f . s c o r e (f e a t u r e s t r a i n , l a b e l s t r a i n ) )

p r i n t (” Train ing s e t s co r e : %f ” % c l f . s c o r e ( o s f e a t u r e s ,o s l a b e l s ) )

y pred=c l f . p r e d i c t ( f e a t u r e s t e s t )c=con fus i on matr ix ( l a b e l s t e s t , y pred )

p r i n t ( c )

31

pr in t ( ’ Accuracy : ’ + s t r (np . round (100∗ f l o a t ( ( c [ 0 ] [ 0 ] + c[ 1 ] [ 1 ] ) ) / f l o a t ( ( c [ 0 ] [ 0 ] + c [ 1 ] [ 1 ] + c [ 1 ] [ 0 ] + c [ 0 ] [ 1 ] ) ), 2 ) ) +’%’)

p r i n t ( ’ Reca l l : ’ + s t r (np . round (100∗ f l o a t ( ( c [ 1 ] [ 1 ] ) ) /f l o a t ( ( c [ 1 ] [ 0 ] + c [ 1 ] [ 1 ] ) ) , 2 ) ) +’%’)

B TensorFlow code

import pandas as pdimport numpy as npimport t en so r f l ow as t ffrom s k l ea rn . c r o s s v a l i d a t i o n import t r a i n t e s t s p l i timport matp lo t l i b . pyplot as p l tfrom s k l ea rn . u t i l s import s h u f f l eimport seaborn as snsimport matp lo t l i b . g r i d spe c as g r i d spe cimport mathfrom s k l ea rn . met r i c s import con fu s i on matr ixfrom s k l ea rn . met r i c s import r e c a l l s c o r edata= pd . r ead c sv ( ” . / c r e d i t c a r d . csv ”)# Create a l a b e l to mark ca s e s f o r normal ( non−f r audu l en t

) t r a n s a c t i o n s .data . l o c [ data . Class == 0 , ’ Normal ’ ] = 1data . l o c [ data . Class == 1 , ’ Normal ’ ] = 0# Rename ’ Class ’ to ’ Fraud ’ .data= data . rename ( columns={ ’ Class ’ : ’ Fraud ’ } )# Create dataframes o f only Fraud and Normal t r a n s a c t i o n s

.Fraud = data [ data . Fraud == 1 ]Normal = data [ data . Normal == 1 ]

# Set X tra in equal to 80% of the f r audu l en t t r a n s a c t i o n s.

X tra in = Fraud . sample ( f r a c =0.8)count Frauds = len ( X tra in )# Add 80% of the normal t r a n s a c t i o n s to X tra in .X tra in = pd . concat ( [ X train , Normal . sample ( f r a c =0.8) ] ,

a x i s =0)# X test conta in s a l l the t r a n s a c t i o n s not in X tra in .X tes t = data . l o c [ ˜ data . index . i s i n ( X tra in . index ) ]

# S h u f f l e the dataframes so that the network i s t r a in edin a random order .

X tra in = s h u f f l e ( X tra in )X tes t = s h u f f l e ( X tes t )

# Add our t a r g e t va lue s to y t r a i n and y t e s t .y t r a i n = X tra in . Fraudy t r a i n = pd . concat ( [ y t ra in , X tra in . Normal ] , a x i s =1)

32

y t e s t = X test . Fraudy t e s t = pd . concat ( [ y t e s t , X tes t . Normal ] , a x i s =1)

# Delete t a r g e t va lue s from X tra in and X tes t .X tra in = X tra in . drop ( [ ’ Fraud ’ , ’ Normal ’ ] , a x i s =1)X tes t = X test . drop ( [ ’ Fraud ’ , ’ Normal ’ ] , a x i s =1)

#The datase t needs to be balanced . Ratio i s de f i ned byd i v i d i n g the number o f t r a n s a c t i o n s by fraudt r a n s a c t i o n s .

#Thus the number o f f raud t r a n s a c t i o n s m u l t i p l i e d withthe r a t i o w i l l equal the number o f normal t r a n s a c t i o n s.

r a t i o = len ( X tra in ) / count Fraudsy t r a i n . Fraud ∗= r a t i o

# Names o f a l l o f the f e a t u r e s in X tra in .f e a t u r e s = X tra in . columns . va lue s

# Transform each f e a t u r e in f e a t u r e space so that i t hasa mean o f 0 and standard dev i a t i on o f 1 ;

# t h i s he lps with t r a i n i n g the neura l network .f o r f e a t u r e in f e a t u r e s :

mean , std = data [ f e a t u r e ] . mean ( ) , data [ f e a t u r e ] . s td ( )X tra in . l o c [ : , f e a t u r e ] = ( X tra in [ f e a t u r e ] − mean) /

stdX tes t . l o c [ : , f e a t u r e ] = ( X tes t [ f e a t u r e ] − mean) /

std

# Train the Neural Network . S p l i t the t e s t i n g data in tov a l i d a t i o n and t e s t i n g s e t s .

s p l i t = i n t ( l en ( y t e s t ) / 2)inputX = X tra in . as matr ix ( )inputY = y t r a i n . as matr ix ( )inputX va l id = X test . a s matr ix ( ) [ : s p l i t ]inputY va l id = y t e s t . a s matr ix ( ) [ : s p l i t ]i nputX tes t = X test . a s matr ix ( ) [ s p l i t : ]i nputY tes t = y t e s t . a s matr ix ( ) [ s p l i t : ]

# Number o f input nodes .n input=X tra in . shape [ 1 ]# Number o f nodes in each hidden l a y e rn hidden 1 = 20n hidden 2 = 20n hidden 3 = 20n hidden 4 = 20

# inputx = t f . p l a c eho lde r ( t f . f l o a t32 , [ None , n input ] )

33

# l a y e r 1W1 = t f . Var iab le ( t f . t runcated normal ( [ n input , n hidden 1

] , stddev =0.01) )b1 = t f . Var iab le ( t f . z e r o s ( [ n hidden 1 ] ) )y1 = t f . nn . s igmoid ( t f . matmul (x , W1) + b1 )

# l a y e r 2W2 = t f . Var iab le ( t f . t runcated normal ( [ n hidden 1 ,

n hidden 2 ] , stddev =0.01) )b2 = t f . Var iab le ( t f . z e r o s ( [ n hidden 2 ] ) )y2 = t f . nn . s igmoid ( t f . matmul ( y1 , W2) + b2 )





# output l a y e rW5 = t f . Var iab le ( t f . t runcated normal ( [ n hidden 4 , 2 ] ,

stddev =0.01) )b5 = t f . Var iab le ( t f . z e r o s ( [ 2 ] ) )y5 = t f . nn . softmax ( t f . matmul ( y4 , W5) + b5 )

# output and t a r g e t va lue sy = y5t a r g e t = t f . p l a c eho lde r ( t f . f l o a t32 , [ None , 2 ] )

# Parameters o f the modelt r a i n i n g e p o c h s = 1000d i s p l a y s t e p = 10n samples = y t r a i n . shape [ 0 ]b a t c h s i z e = 2048l e a r n i n g r a t e = 0.005t o t a l b a t c h=i n t ( n samples / b a t c h s i z e )

# Compute the cros s−entropy co s t func t i onco s t = t f . reduce mean(− t f . reduce sum ( t a r g e t ∗ t f . l og ( y ) ) )

# Update the weights o f the model v ia AdamOptimizeropt imize r = t f . t r a i n . AdamOptimizer ( l e a r n i n g r a t e ) .

minimize ( co s t )

# Check i f the p r e d i c t i o n from the output l a y e r mathches

34

the t a r g e t l a b e l .c o r r e c t p r e d i c t i o n = t f . equal ( t f . argmax (y , 1) , t f . argmax (

target , 1) )accuracy = t f . reduce mean ( t f . c a s t ( c o r r e c t p r e d i c t i o n , t f .

f l o a t 3 2 ) )

accuracy summary = [ ] # Record accuracy va lue s f o r p l o tcost summary = [ ] # Record co s t va lue s f o r p l o tval id accuracy summary = [ ]va l id cost summary = [ ]

# I n i t i a l i z e the v a r i a b l e si n i t=t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( )

#Execute the ope ra t i on s in the graphwith t f . S e s s i on ( ) as s e s s :

s e s s . run ( i n i t )#Training the network

f o r epoch in range ( t r a i n i n g e p o c h s ) :f o r i in range ( t o t a l b a t c h ) :

s e s s . run ( [ opt imize r ] , f e e d d i c t={x : inputX [i ∗ b a t c h s i z e : (1 +i ) ∗ b a t c h s i z e ] ,t a r g e t : inputY [ i ∗ b a t c h s i z e : (1 + i )

∗ b a t c h s i z e ] } )# Display l o g s a f t e r every 10 epochsi f epoch % d i s p l a y s t e p == 0 :

t ra in accuracy , newCost = s e s s . run ( [ accuracy ,co s t ] , f e e d d i c t={x : inputX , t a r g e t :

inputY })

va l id accuracy , val id newCost = s e s s . run ( [accuracy , co s t ] , f e e d d i c t={x :inputX val id , t a r g e t : inputY va l id })

p r i n t ( epoch ,” { : . 5 f }” . format ( t r a i n a c c u r a c y ) ,” { : . 5 f }” . format ( newCost ) ,” { : . 5 f }” . format ( v a l i d a c c u r a c y ) ,” { : . 5 f }” . format ( val id newCost ) )

# Record the r e s u l t s o f the modelaccuracy summary . append ( t r a i n a c c u r a c y )cost summary . append ( newCost )val id accuracy summary . append ( v a l i d a c c u r a c y )val id cost summary . append ( val id newCost )

#Obtain accuracy and r e c a l l on the t e s t s e ty p=t f . argmax (y , 1 )t e s t ac cu racy , y pred=s e s s . run ( [ accuracy , y p ] ,

f e e d d i c t={x : inputX test , t a r g e t : inputY tes t })y t rue = np . argmax ( inputY test , 1 )

35

c=con fus i on matr ix ( y true , y pred )

p r i n t ( ’ Test accuracy : ’ + s t r (np . round (100∗ f l o a t ( ( c[ 0 ] [ 0 ] + c [ 1 ] [ 1 ] ) ) / f l o a t ( ( c [ 0 ] [ 0 ] + c [ 1 ] [ 1 ] + c [ 1 ] [ 0 ]+ c [ 0 ] [ 1 ] ) ) , 2 ) ) +’%’)

p r i n t ( ’ Reca l l : ’ + s t r (np . round (100∗ f l o a t ( ( c [ 0 ] [ 0 ] ) )/ f l o a t ( ( c [ 0 ] [ 0 ] + c [ 0 ] [ 1 ] ) ) , 2 ) ) +’%’)

f , ( ax1 , ax2 ) = p l t . subp lo t s (2 , 1 , sharex=True , f i g s i z e=(10 ,4) )

ax1 . p l o t ( accuracy summary ) # blueax1 . p l o t ( val id accuracy summary ) #greenax1 . s e t t i t l e ( ’ Accuracy ’ )

ax2 . p l o t ( cost summary )ax2 . p l o t ( va l id cost summary )ax2 . s e t t i t l e ( ’ Cost ’ )

p l t . x l a b e l ( ’ Epochs ( x10 ) ’ )p l t . show ( )

36

References

[1] C. M. Bishop, Neural networks for pattern recognition. Oxford universitypress, 1995.

[2] A. Dongare, R. Kharde, and A. D. Kachare, “Introduction to artificialneural network,”

[3] J. Jiang, P. Trundle, and J. Ren, “Medical image analysis with artificialneural networks,” Computerized Medical Imaging and Graphics, vol. 34,no. 8, pp. 617–631, 2010.

[4] S. S. Haykin, Neural networks and learning machines. Upper Saddle River,NJ: Pearson Education, third ed., 2009.

[5] J. Stiles and T. L. Jernigan, “The basics of brain development,” Neuropsy-chology review, vol. 20, no. 4, pp. 327–348, 2010.

[6] A. J. Izenman, Modern multivariate statistical techniques, vol. 1. Springer,2008.

[7] H. Paugam-Moisy and S. Bohte, “Computing with spiking neuron net-works,” in Handbook of natural computing, pp. 335–376, Springer, 2012.

[8] Y. Chalich and J. Li, “The essential role of nonlinearity in neural networks,”2017.

[9] V. Nair and G. E. Hinton, “Rectified linear units improve restrictedboltzmann machines,” in Proceedings of the 27th International Conferenceon Machine Learning (ICML-10) (J. Furnkranz and T. Joachims, eds.),pp. 807–814, Omnipress, 2010.

[10] R. Arora, A. Basu, P. Mianjy, and A. Mukherjee, “Understanding deepneural networks with rectified linear units,” CoRR, vol. abs/1611.01491,2016.

[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification withdeep convolutional neural networks,” in Advances in neural informationprocessing systems, pp. 1097–1105, 2012.

[12] D. R. Musicant, J. M. Christensen, and J. F. Olson, “Supervised learningby training on aggregate outputs,” in Data Mining, 2007. ICDM 2007.Seventh IEEE International Conference on, pp. 252–261, IEEE, 2007.

[13] T. M. Mitchell, Machine Learning. New York, NY, USA: McGraw-Hill,Inc., 1 ed., 1997.

[14] S. Ruder, “An overview of gradient descent optimization algorithms,”CoRR, vol. abs/1609.04747, 2016.

[15] R. B. Rao, G. Fung, and R. Rosales, “On the dangers of cross-validation. anexperimental evaluation,” in Proceedings of the 2008 SIAM InternationalConference on Data Mining, pp. 588–596, SIAM, 2008.

37

[16] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdi-nov, “Dropout: a simple way to prevent neural networks from overfitting.,”Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

[17] G. N. Karystinos and D. A. Pados, “On overfitting, generalization, andrandomly expanded training sets,” IEEE Transactions on Neural Networks,vol. 11, pp. 1050–1057, Sep 2000.

[18] E. Phaisangittisagul, “An analysis of the regularization between l2 anddropout in single hidden layer neural network,” in Intelligent Systems,Modelling and Simulation (ISMS), 2016 7th International Conference on,pp. 174–179, IEEE, 2016.

[19] S. J. Nowlan and G. E. Hinton, “Simplifying neural networks by soft weight-sharing,” Neural computation, vol. 4, no. 4, pp. 473–493, 1992.

[20] P. Goldsborough, “A tour of tensorflow,” arXiv preprint arXiv:1610.01178,2016.

[21] R. Collobert, S. Bengio, and J. Marithoz, “Torch: A modular machinelearning software library,” 2002.

[22] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Des-jardins, J. Turian, D. Warde-Farley, and Y. Bengio, “Theano: A cpu andgpu math compiler in python,” 2010.

[23] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional Architecture forFast Feature Embedding,” ArXiv e-prints, June 2014.

[24] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press,2016. http://www.deeplearningbook.org.

[25] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. Cor-rado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp,G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Lev-enberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster,J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke,V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke,Y. Yu, and X. Zheng, “Tensorflow: Large-scale machine learning on het-erogeneous distributed systems,” 2015.

[26] A. Ben-Hur and J. Weston, “A user’s guide to support vector machines,”Data mining techniques for the life sciences, pp. 223–239, 2010.

[27] K. W. Bowyer, N. V. Chawla, L. O. Hall, and W. P. Kegelmeyer, “SMOTE:synthetic minority over-sampling technique,” CoRR, vol. abs/1106.1813,2011.

[28] T. Y. K. Beh, S. C. Tan, and H. T. Yeo, “Building classification modelsfrom imbalanced fraud detection data,”

[29] S. L. Phung, A. Bouzerdoum, and G. H. Nguyen, “Learning pattern clas-sification tasks with imbalanced data sets,” 2009.

38

[30] S. Dreiseitl and L. Ohno-Machado, “Logistic regression and artificial neuralnetwork classification models: a methodology review,” Journal of biomed-ical informatics, vol. 35, no. 5, pp. 352–359, 2002.

[31] K. G. Sheela and S. N. Deepa, “Review on methods to fix number of hid-den neurons in neural networks,” Mathematical Problems in Engineering,vol. 2013, 2013.

[32] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin, “Exploring strate-gies for training deep neural networks,” Journal of Machine Learning Re-search, vol. 10, no. Jan, pp. 1–40, 2009.

[33] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote:synthetic minority over-sampling technique,” Journal of artificial intelli-gence research, vol. 16, pp. 321–357, 2002.

[34] Currie32, “Tensorflow code for credit card fraud detection,https://www.kaggle.com,” 2017.

[35] X. Glorot and Y. Bengio, “Understanding the difficulty of training deepfeedforward neural networks,” in Proceedings of the Thirteenth Interna-tional Conference on Artificial Intelligence and Statistics, pp. 249–256,2010.

39

deep neural networks and fraud detection1150344/...the purpose of this thesis is to detect credit...

Documents