artificial neural networksartificial neural networks x1 x2 x3 what are they? inspired by the human...
TRANSCRIPT
Artificial Neural Networks
x1
x2
x3
What are they?Inspired by the Human Brain. The human brain has about 86 Billion neurons and requires 20% of your body’s energy to function. These neurons are connected to between 100 Trillion to 1 Quadrillion synapses!
What are they?Human Machine
Input Output
What are they?1. Originally developed by Warren McCullough and Walter Pitts (1944) 2. First trainable network, perceptron, proposed by Frank Rosenblatt
(1957) 3. Started off as an unsupervised learning tool.
1. Had problems with computing time and could not compute XOR 2. Was abandoned in favor of other algorithms
4. Werbos's (1975) backpropagation algorithm 1. Incorporated supervision and solved XOR
2. But were still too slow vs. other algorithms e.g., Support Vector Machines
5. Backpropagation was accelerated by GPUs in 2010 and shown to be more efficient and cost effective
GPUSGPUS handle parallel operations much better (thousands of threads per core) but are not as quick as CPUs. However, the matrix multiplication steps in ANNs can be run in parallel resulting in considerable time + cost savings. The best CPUs handle about 50GB/s while the best GPUs handle 750GB/s memory bandwidth.
CPU i9 Xseries
GeForce GTX 1080
Cores 18 (36 threads) 2560
Clock Speed (GHz)
4.4 1.6G
Memory Shared 8GB
Price ($) 1799 549
What are they?Human Machine
Input Output
Neuron
x0
x1
x2
w0 = − 10
w1 = 20
w2 = 20
y = g(w0x0 + w1x1 + w2x2)
g − activation function
bias
Neuron
Inputs
z = x0w0 + x1w1 + x2w2 =2
∑i=0
wixi y = g(z)
x0
x1
x2
w0 = − 10
w1 = 20
w2 = 20
y = g(w0x0 + w1x1 + w2x2)
g − activation function
bias
Activation function
z = x0w0 + x1w1 + x2w2 =2
∑i=0
wi xi y = g (z )
x0
x1
x2
w0 = − 10
w1 = 20
w2 = 20
y = g (w0x0 + w1x1 + w2x2)
g − activation function
g(z) =1
1 + e−z
z = w0x0 + w1x1 + w2x2
bias
Activation Functions■ Sigmoid
■ ReLU
■ Hyperbolic tangent
■ Leaky ReLU
g(z) =1
1 + e−z
tanh(z) =2
1 + e−2z− 1
r(z) = max(0,z) lr(z) = {αz for z < 0z for z ≥ 0
Neural Network
x1
x2
x3
Input layer Hidden layers Output layer
Input
Output
b0b1
XOR Gate
YX
0 0 0
0 1 1
1 0 1
1 1 0
X
Y
h0
h1
w0
w1
w2
w3
w5
w4z0
z1
h2z2
za
z
XOR Gate
0 0 0
0 1 1
1 0 1
1 1 0
h0
h1
z0
z1
h2z2
za
z
w0 = 0.4
w1 = 0.6
w2 = 0.3
w3 = 0.8w5 = 0.1
w4 = 0.7
output = h2 = z
x y
x
y
Feedforward propagation
0 0 00 1 1
1 0 1
1 1 0
h0
h1
z0
z1
h2z2
za
z
w0 = 0.4
w1 = 0.6
w2 = 0.3
w3 = 0.8w5 = 0.1
w4 = 0.7
out put = h2 = z
x y
x
y
z0 = wox + w2y= 0.4 × 1 + 0.3 × 1= 0.7
x = 1
activation function, g(z) =1
1 + e−z
z1 = w1x + w3y= 0.6 × 1 + 0.8 × 1= 1.4
h0 = g(z0) =1
1 + e−z0
=1
1 + e−0.7
h0 = 0.67
h1 = g(z1) =1
1 + e−z1
=1
1 + e−1.4
h1 = 0.8
z2 = w4h0 + w5h1
= 0.7 × 0.67 + 0.1 × 0.8= 0.63
h2 = g(z2) =1
1 + e−z2
=1
1 + e−0.63
h2 = 0.65
y = 1
h2 = z = 0.65
Feedforward propagation
0 0 00 1 1
1 0 1
1 1 0
h0
h1
z0
z1
h2z2
za
z
w0 = 0.4
w1 = 0.6
w2 = 0.3
w3 = 0.8w5 = 0.1
w4 = 0.7
out put = h2 = z
x y
x
y
activation function, g(z) =1
1 + e−z
z0 = wox + w2y= 0.4 × 1 + 0.3 × 1= 0.7
x = 1
z1 = w1x + w3y= 0.6 × 1 + 0.8 × 1= 1.4
h0 = g (z0) =1
1 + e−z0
=1
1 + e−0.7h0 = 0.67
h1 = g (z1) =1
1 + e−z1
=1
1 + e−1.4h1 = 0.8
z2 = w4h0 + w5h1= 0.7 × 0.67 + 0.1 × 0.8= 0.63
h2 = g (z2) =1
1 + e−z2
=1
1 + e−0.63h2 = 0.65
y = 1
; h2 = z = 0.65
Mean Squared Error, E =12
( z − za)2 =12
(0.65 − 0)2 = 0.21
Backpropagation
0 0 00 1 1
1 0 1
1 1 0
h0
h1
z0
z1
h2z2
za
z
w0 = 0.4
w1 = 0.6
w2 = 0.3
w3 = 0.8w5 = 0.1
w4 = 0.7
out put = h2 = z
x y
x
y
activation function, g(z) =1
1 + e−z
Mean Squared Error, E =12
( z − za)2 =12
(h2 − za)2 =12
(0.65 − 0)2 = 0.21
E =12
( z − za)2
∂E∂w4
=∂E∂h2
⋅∂h2
∂z2⋅
∂z2
∂w4= 0.65 × 0.228 × 0.67 = 0.09
∂E∂h2
=∂
∂h2 [ 12
(h2 − za)2] = h2 − za = 0.65
∂h2
∂z2=
∂∂z2
g(z2) =∂
∂z2 ( 11 + e−z2 ) = g(z2)(1 − g(z2)) = h2(1 − h2) = 0.65 × 0.35 = 0.228
g′�(z) =∂∂z
g(z) =∂∂z ( 1
1 + e−z ) = g(z)[1 − g(z)]
∂z2
∂w4=
∂∂w4
[w4h0 + w5h1] = h0 = 0.67
Backpropagation
0 0 00 1 1
1 0 1
1 1 0
h0
h1
z0
z1
h2z2
za
z
w0 = 0.4
w1 = 0.6
w2 = 0.3
w3 = 0.8w5 = 0.1
w4 = 0.7
out put = h2 = z
x y
x
y
activation function, g(z) =1
1 + e−z
Mean Squared Error, E =12
( z − za)2 =12
(h2 − za)2 =12
(0.65 − 0)2 = 0.21
E =12
( z − za)2
∂E∂w5
=∂E∂h2
⋅∂h2
∂z2⋅
∂z2
∂w5= 0.65 × 0.228 × 0.8 = 0.118
∂E∂h2
=∂
∂h2 [ 12
(h2 − za)2] = h2 − za = 0.65
∂h2
∂z2=
∂∂z2
g(z2) =∂
∂z2 ( 11 + e−z2 ) = g(z2)(1 − g(z2)) = h2(1 − h2) = 0.65 × 0.35 = 0.228
g′�(z) =∂∂z
g(z) =∂∂z ( 1
1 + e−z ) = g(z)[1 − g(z)]
∂z2
∂w5=
∂∂w5
[w4h0 + w5h1] = h1 = 0.8
Backpropagation
0 0 00 1 1
1 0 1
1 1 0
h0
h1
z0
z1
h2z2
za
z
w0 = 0.4
w1 = 0.6
w2 = 0.3
w3 = 0.8w5 = 0.1
w4 = 0.7
out put = h2 = z
x y
x
y
activation function, g(z) =1
1 + e−z
Mean Squared Error, E =12
( z − za)2 =12
(h2 − za)2 =12
(0.65 − 0)2 = 0.21
E =12
( z − za)2
∂E∂w0
=∂E∂h0
⋅∂h0
∂z0⋅
∂z0
∂w0= 0.103 × 0.221 × 1 = 0.227
∂E∂h0
=∂E∂h2
⋅∂h2
∂z2⋅
∂z2
∂h0= h2 × h2(1 − h2) × w4 = 0.65 × 0.65(1 − 0.65) × 0.7 = 0.103
∂h0
∂z0=
∂∂z0
g(z0) =∂
∂z0 ( 11 + e−z0 ) = g(z0)(1 − g(z0)) = h0(1 − h0) = 0.67 × 0.33 = 0.221
g′�(z) =∂∂z
g(z) =∂∂z ( 1
1 + e−z ) = g(z)[1 − g(z)]
∂z0
∂w0=
∂∂w0
[w0x + w2y] = x = 1
Backpropagation
0 0 00 1 1
1 0 1
1 1 0
h0
h1
z0
z1
h2z2
za
z
w0 = 0.4
w1 = 0.6
w2 = 0.3
w3 = 0.8w5 = 0.1
w4 = 0.7
out put = h2 = z
x y
x
y
activation function, g(z) =1
1 + e−z
Mean Squared Error, E =12
( z − za)2 =12
(h2 − za)2 =12
(0.65 − 0)2 = 0.21
E =12
( z − za)2
∂E∂w2
=∂E∂h0
⋅∂h0
∂z0⋅
∂z0
∂w2= 0.103 × 0.221 × 1 = 0.227
∂E∂h0
=∂E∂h2
⋅∂h2
∂z2⋅
∂z2
∂h0= h2 × h2(1 − h2) × w4 = 0.65 × 0.65(1 − 0.65) × 0.7 = 0.103
∂h0
∂z0=
∂∂z0
g(z0) =∂
∂z0 ( 11 + e−z0 ) = g(z0)(1 − g(z0)) = h0(1 − h0) = 0.67 × 0.33 = 0.221
g′�(z) =∂∂z
g(z) =∂∂z ( 1
1 + e−z ) = g(z)[1 − g(z)]
∂z0
∂w2=
∂∂w2
[w0x + w2y] = y = 1
Backpropagation
0 0 00 1 1
1 0 1
1 1 0
h0
h1
z0
z1
h2z2
za
z
w0 = 0.4
w1 = 0.6
w2 = 0.3
w3 = 0.8w5 = 0.1
w4 = 0.7
out put = h2 = z
x y
x
y
activation function, g(z) =1
1 + e−z
Mean Squared Error, E =12
( z − za)2 =12
(h2 − za)2 =12
(0.65 − 0)2 = 0.21
E =12
( z − za)2
∂E∂w1
=∂E∂h1
⋅∂h1
∂z1⋅
∂z1
∂w1= 0.014 × 0.16 × 1 = 0.0022
∂E∂h1
=∂E∂h2
⋅∂h2
∂z2⋅
∂z2
∂h1= h2 × h2(1 − h2) × w5 = 0.65 × 0.65(1 − 0.65) × 0.1 = 0.014
∂h1
∂z1=
∂∂z1
g(z1) =∂
∂z1 ( 11 + e−z1 ) = g(z1)(1 − g(z1)) = h1(1 − h1) = 0.8 × 0.2 = 0.16
g′�(z) =∂∂z
g(z) =∂∂z ( 1
1 + e−z ) = g(z)[1 − g(z)]
∂z0
∂w1=
∂∂w1
[w1x + w3y] = x = 1
Backpropagation
0 0 00 1 1
1 0 1
1 1 0
h0
h1
z0
z1
h2z2
za
z
w0 = 0.4
w1 = 0.6
w2 = 0.3
w3 = 0.8w5 = 0.1
w4 = 0.7
out put = h2 = z
x y
x
y
activation function, g(z) =1
1 + e−z
Mean Squared Error, E =12
( z − za)2 =12
(h2 − za)2 =12
(0.65 − 0)2 = 0.21
E =12
( z − za)2
∂E∂w3
=∂E∂h1
⋅∂h1
∂z1⋅
∂z1
∂w3= 0.014 × 0.16 × 1 = 0.0022
∂E∂h1
=∂E∂h2
⋅∂h2
∂z2⋅
∂z2
∂h1= h2 × h2(1 − h2) × w5 = 0.65 × 0.65(1 − 0.65) × 0.1 = 0.014
∂h1
∂z1=
∂∂z1
g(z1) =∂
∂z1 ( 11 + e−z1 ) = g(z1)(1 − g(z1)) = h1(1 − h1) = 0.8 × 0.2 = 0.16
g′�(z) =∂∂z
g(z) =∂∂z ( 1
1 + e−z ) = g(z)[1 − g(z)]
∂z0
∂w1=
∂∂w3
[w1x + w3y] = y = 1
Gradient descent weight updates0 0 00 1 1
1 0 1
1 1 0
h0
h1
z0
z1
h2z2
za
z
w0 = 0.4
w1 = 0.6
w2 = 0.3
w3 = 0.8w5 = 0.1
w4 = 0.7
out put = h2 = z
x y
x
y
activation function, g(z) =1
1 + e−z
Mean Squared Error, E =12
( z − za)2 =12
(h2 − za)2 =12
(0.65 − 0)2 = 0.21
E =12
( z − za)2
g′�(z) =∂∂z
g(z) =∂∂z ( 1
1 + e−z ) = g(z)[1 − g(z)]
z0 = wox + w2y= 0.4 × 1 + 0.3 × 1= 0.7
x = 1
z1 = w1x + w3y= 0.6 × 1 + 0.8 × 1= 1.4
h0 = g (z0) =1
1 + e−z0
=1
1 + e−0.7h0 = 0.67
h1 = g (z1) =1
1 + e−z1
=1
1 + e−1.4h1 = 0.8
z2 = w4h0 + w5h1= 0.7 × 0.67 + 0.1 × 0.8= 0.63
h2 = g (z2) =1
1 + e−z2
=1
1 + e−0.63h2 = 0.65
y = 1
h2 = z = 0.65
∂E∂ w4
=∂E∂h2
⋅∂h2∂z2
⋅∂z2∂ w4
= 0.65 × 0.228 × 0.67 = 0.09
∂E∂ w5
=∂E∂h2
⋅∂h2∂z2
⋅∂z2∂ w5
= 0.65 × 0.228 × 0.8 = 0.118
∂E∂ w0
=∂E∂h0
⋅∂h0∂z0
⋅∂z0∂ w0
= 0.103 × 0.221 × 1 = 0.227
∂E∂w2
=∂E∂h0
⋅∂h0∂z0
⋅∂z0∂w2
= 0.103 × 0.221 × 1 = 0.227
∂E∂w1
=∂E∂h1
⋅∂h1∂z1
⋅∂z1∂w1
= 0.014 × 0.16 × 1 = 0.0022
∂E∂w3
=∂E∂h1
⋅∂h1∂z1
⋅∂z1∂w3
= 0.014 × 0.16 × 1 = 0.0022
Learning rate, α = 0.01
w4 = w4 − α∂E∂w4
; w5 = w5 − α∂E∂w5
; w0 = w0 − α∂E∂w0
; w1 = w1 − α∂E∂w1
; w2 = w2 − α∂E∂w2
; w3 = w3 − α∂E∂w3
;
feed for ward
backprop
grad descentweight update