artificial neural networks 1 morten nielsen department of systems biology, dtu · 2014. 6. 17. ·...

Artificial Neural Networks 1

Morten Nielsen Department of Systems Biology,

DTU

Biological Neural network

Biological neuron structure

Artificial neural networks. Background

Higher order sequence correlations

•  Neural networks can learn higher order correlations! –  What does this mean?

S S => 0 L S => 1 S L => 1 L L => 0

Say that the peptide needs one and only one large amino acid in the positions P3 and P4 to fill the binding cleft How would you formulate this to test if a peptide can bind?

=> XOR function

Neural networks

•  Neural networks can learn higher order correlations

XOR function: 0 0 => 0 1 0 => 1 0 1 => 1 1 1 => 0

(1,1)

(1,0) (0,0)

(0,1)

No linear function can separate the points OR

AND

XOR

Error estimates

XOR 0 0 => 0 1 0 => 1 0 1 => 1 1 1 => 0

(1,1)

(1,0) (0,0)

(0,1) Predict 0 1 1 1

Error 0 0 0 1

Mean error: 1/4

Linear methods and the XOR function

v1 v2

Linear function

€

y = x1 ⋅ v1 + x2 ⋅ v2

Neural networks with a hidden layer

w12

v1

w21 w22

v2

1

wt2 wt1 w11

1

vt

Input 1 (Bias) {

€

O =1

1+ exp(−o)

o = xii=1

N

∑ ⋅wi + t = xii=1

N+1

∑ ⋅wi

xN =1

How does it work? Ex. Input is (0 0)

0 0 6

-9

4 6

9

1

-2 -6 4

1

-4.5

Input 1 (Bias) {o1=-6 O1=0

o2=-2 O2=0

y1=-4.5 Y1=0

€

o = xi∑ ⋅ wi

Gradient decent (from wekipedia)

Gradient descent is based on the observation that if the real-valued function F(x) is defined and differentiable in a neighborhood of a point a, then F(x) decreases fastest if one goes from a in the direction of the negative gradient of F at a. It follows that, if for ε > 0 a small enough number, then F(b)<F(a)

€

b = a −ε ⋅ ∇F(a)

Gradient decent. Example

Weights are changed in the opposite direction of the gradient of the error

wi' = wi +Δwi

E = 12 ⋅ (O− t)

2

O = wii∑ ⋅ Ii

Δwi = −ε ⋅∂E∂wi

= −ε ⋅∂E∂O

⋅∂O∂wi

= −ε ⋅ (O− t) ⋅ Ii

I1 I2

w1 w2

Linear function

€

O = I1 ⋅ w1 + I2 ⋅ w2

O

ANNs - Hidden to output layer

∂E∂wj

=∂E∂O

⋅∂O∂o

⋅∂o∂wj

∂E(O(o(wj )))∂wj

=

E = 12 ⋅ (O− t)

2

O = g(o)

o = wii∑ ⋅ Ii

Hidden to output layer

∂E∂wj

=∂E∂O

⋅∂O∂o

⋅∂o∂wj

= (O− t) ⋅ g '(o) ⋅H j

∂E∂O

= (O− t)

∂O∂o

=∂g∂o

= g '(o)

∂o∂wj

=1∂wj

wj ⋅l∑ H j = H j

o = wii∑ ⋅H j

What about the hidden layer?

Δvjk = −ε ⋅∂E∂vjk

o = wjj∑ ⋅H j

hj = vjkk∑ ⋅ Ik

€

E = 12 ⋅ (O− t)

2

O = g(o),H = g(h)

g(x) =1

1+ e−x

Input to hidden layer

∂E∂vjk

=∂E(O(o(H j (hj (vjk ))))

∂vjk

=∂E∂O

⋅∂O∂o

⋅∂o∂H j

⋅∂H j

∂hj⋅∂hj∂vjk

= (O− t) ⋅ g '(o) ⋅wj ⋅ g '(hj ) ⋅ Ik

Summary

∂E∂wj

= (O− t) ⋅ g '(o) ⋅H j

∂E∂vjk

= (O− t) ⋅ g '(o) ⋅wj ⋅ g '(hj ) ⋅ Ik

Or

∂E∂wj

= (O− t) ⋅ g '(o) ⋅H j = δ ⋅H j

∂E∂vjk

= (O− t) ⋅ g '(o) ⋅wj ⋅ g '(hj ) ⋅ Ik = δ ⋅wj ⋅ g '(hj ) ⋅ Ik

δ = (O− t) ⋅ g '(o)

Neural networks and the XOR function

Deep(er) Network architecture

€

E = 12 ⋅ (O− t)

2

O = g(o),H = g(h)

g(x) =1

1+ e−x

o = wjj∑ ⋅H

j

2

hj

2 = vjkk∑ ⋅H

k

1

hk

1 = ukll∑ ⋅ Il

Δwi = −ε ⋅∂E∂wi

Deeper Network architecture Il Input layer, l

1. Hidden layer, k

Output layer

h1k

H1k

h3

H3

ukl

wj

2. Hidden layer, j

vjk

h2j

H2j

∂E∂wj

=∂E(H 3(h3(wj )))

∂wj

=∂E∂H 3 ⋅

∂H 3

∂h3⋅∂h3

∂wj

= (H 3 − t) ⋅ g '(h3) ⋅H j2

Network architecture (hidden to hidden)

∂E∂vjk

=∂E∂H 3 ⋅

∂H 3

∂h3⋅∂h3

∂Hj

2 ⋅∂H

j

2

∂hj

2 ⋅∂h

j

2

∂vjk= (H 3 − t) ⋅ g '(h3) ⋅wj ⋅ g '(hj

2 ) ⋅Hk1

Network architecture (input to hidden)

∂E∂ukl

=∂E∂H 3 ⋅

∂H 3

∂h3⋅

∂h3

∂Hj

2 ⋅∂H

j

2

∂hj

2 ⋅∂h

j

2

∂Hk1

j∑ ⋅

∂Hk1

∂hk1 ⋅∂h

k

1

∂ukl

= (H 3 − t) ⋅ g '(h3) ⋅ wj ⋅ g '(hj

2 ) ⋅ vjkj∑ ⋅ g '(h

k

1 ) ⋅ Il

Use delta’s

hj

q = wjii∑ H

i

q−1

Hi

q−1 = g(hi

q−1)

j

k

l

δj

vjk

ukl

δk

Bishop, Christopher (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. ISBN 0-19-853864-2.

Use delta’s

∂E∂w

ji

q =∂E∂h

j

q ⋅∂hj

q

∂wji

q = δ j

q ⋅Hi

q−1

δj

q =∂E∂h

j

q

δ3 =∂E∂h3

=∂E∂H 3 ⋅

∂H 3

∂h3= (H 3 − t) ⋅ g '(h3)

δ j2 =

∂E∂h

j

2 =∂E∂h3

⋅∂h3

∂hj2 =

∂E∂h3

⋅∂h3

∂H j2 ⋅∂H j

2

∂hj2 = g '(hj

2 ) ⋅δ3 ⋅ vjk

δk1 =

∂E∂hk

1 =∂E∂hj

2j∑ ⋅

∂hj2

∂hk1 =

∂E∂hj

2j∑ ⋅

∂hj2

∂Hk1 ⋅∂Hk

1

∂hk1 = g '(hk

1) ⋅ δ j2

j∑ ⋅ vjk

hj = wjii∑ Hi

Hi = g(hi )

Deep learning

http://www.slideshare.net/hammawan/deep-neural-networks

Deep learning – time is not an issue

0

50

100

150

200

250

17000 17500 18000 18500 19000 19500 20000

CPU (u)

Number of weights

17000

17500

18000

18500

19000

19500

0 1 2 3 4 5 6

# w

eigh

ts

N layer

Deep learning


Deep learning


Auto encoder

Deep learning


Pan-specific prediction methods

NetMHC NetMHCpan

Example Peptide Amino acids of HLA pockets HLA Aff VVLQQHSIA YFAVLTWYGEKVHTHVDTLVRYHY A0201 0.131751 SQVSFQQPL YFAVLTWYGEKVHTHVDTLVRYHY A0201 0.487500 SQCQAIHNV YFAVLTWYGEKVHTHVDTLVRYHY A0201 0.364186 LQQSTYQLV YFAVLTWYGEKVHTHVDTLVRYHY A0201 0.582749 LQPFLQPQL YFAVLTWYGEKVHTHVDTLVRYHY A0201 0.206700 VLAGLLGNV YFAVLTWYGEKVHTHVDTLVRYHY A0201 0.727865 VLAGLLGNV YFAVWTWYGEKVHTHVDTLLRYHY A0202 0.706274 VLAGLLGNV YFAEWTWYGEKVHTHVDTLVRYHY A0203 1.000000 VLAGLLGNV YYAVLTWYGEKVHTHVDTLVRYHY A0206 0.682619 VLAGLLGNV YYAVWTWYRNNVQTDVDTLIRYHY A6802 0.407855

Going Deep – One hidden layer

0

0.02

0.04

0.06

0.08

0 50 100 150 200 250 300 350 400 450 500

MSE

# Ietrations

20

0

0.02

0.04

0.06

0.08

0 50 100 150 200 250 300 350 400 450 500

MSE

# Iterations

0

0.2

0.4

0.6

0.8

1

0 50 100 150 200 250 300 350 400 450 500

PCC

# Iterations

Train

Test

Test

Going Deep – 3 hidden layers

0

0.05

0.1

0 50 100 150 200 250 300 350 400 450 500

MSE

# Ietrations

20 20+20 20+20+20

0

0.05

0.1

0 50 100 150 200 250 300 350 400 450 500

MSE

# Iterations

0 0.2 0.4 0.6 0.8

1

0 50 100 150 200 250 300 350 400 450 500

PCC

# Iterations

Train

Test

Test

Going Deep – more than 3 hidden layers

0 0.02 0.04 0.06 0.08

0.1

0 50 100 150 200 250 300 350 400 450 500

MSE

# Ietrations

20 20+20 20+20+20 20+20+20+20 20+20+20+20+20

0

0.05

0.1

0 50 100 150 200 250 300 350 400 450 500

MSE

# Iterations

0

0.5

1

0 50 100 150 200 250 300 350 400 450 500

PCC

# Iterations

Train

Test

Test

Going Deep – Using Auto-encoders

0

0.05

0.1

0 50 100 150 200 250 300 350 400 450 500

MSE

# Ietrations

20 20+20 20+20+20 20+20+20+20 20+20+20+20+20 20+20+20+20+Auto

0 0.02 0.04 0.06 0.08

0.1

0 50 100 150 200 250 300 350 400 450 500

MSE

# Iterations

0

0.5

1

0 50 100 150 200 250 300 350 400 450 500

PCC

# Iterations

Train

Test

Test

Conclusions

•  Implementing Deep networks using deltas1 makes CPU time scale linearly with respect to the number of weights –  So going Deep is not more CPU intensive than

going wide •  Back-propagation is an efficient method

for NN training for shallow networks with up to 3 hidden layers

•  For deeper network, pre-training is required using for instance Auto-encoders

1Bishop, Christopher (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. ISBN 0-19-853864-2.

artificial neural networks 1 morten nielsen department of systems biology, dtu · 2014. 6. 17. ·...

Documents