optimization methods morten nielsen department of systems biology, dtu

40
Optimization methods Morten Nielsen Department of Systems Biology, DTU

Upload: jemima-johnston

Post on 18-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Optimization methods Morten Nielsen Department of Systems Biology, DTU

Optimization methods

Morten NielsenDepartment of Systems Biology,

DTU

Page 2: Optimization methods Morten Nielsen Department of Systems Biology, DTU

Outline

• Optimization procedures – Gradient decent– Monte Carlo

• Overfitting – cross-validation

• Method evaluation

Page 3: Optimization methods Morten Nielsen Department of Systems Biology, DTU

Linear methods. Error estimate

I1 I2

w1 w2

Linear function

o

Page 4: Optimization methods Morten Nielsen Department of Systems Biology, DTU

Gradient decent (from wekipedia)

Gradient descent is based on the observation that if the real-valued function F(x) is defined and differentiable in a neighborhood of a point a, then F(x) decreases fastest if one goes from a in the direction of the negative gradient of F at a. It follows that, if

for > 0 a small enough number, then F(b)<F(a)

Page 5: Optimization methods Morten Nielsen Department of Systems Biology, DTU

Gradient decent (example)

Page 6: Optimization methods Morten Nielsen Department of Systems Biology, DTU

Gradient decent

Page 7: Optimization methods Morten Nielsen Department of Systems Biology, DTU

Gradient decent

Weights are changed in the opposite direction of the gradient of the error

Page 8: Optimization methods Morten Nielsen Department of Systems Biology, DTU

Gradient decent (Linear function)

Weights are changed in the opposite direction of the gradient of the error

I1 I2

w1 w2

Linear function

o

Page 9: Optimization methods Morten Nielsen Department of Systems Biology, DTU

Gradient decent

Weights are changed in the opposite direction of the gradient of the error

I1 I2

w1 w2

Linear function

o

Page 10: Optimization methods Morten Nielsen Department of Systems Biology, DTU

Gradient decent. Example

Weights are changed in the opposite direction of the gradient of the error

I1 I2

w1 w2

Linear function

o

Page 11: Optimization methods Morten Nielsen Department of Systems Biology, DTU

Gradient decent. Example

Weights are changed in the opposite direction of the gradient of the error

I1 I2

w1 w2

Linear function

o

Page 12: Optimization methods Morten Nielsen Department of Systems Biology, DTU

Gradient decent. Doing it your selfWeights are changed in the opposite direction of the gradient of the error

1 0

W1=0.1 W2=0.1

Linear function

o

What are the weights after 2 forward (calculate predictions) and backward (update weights) iterations with the given input, and has the error decrease (use =0.1, and t=1)?

Page 13: Optimization methods Morten Nielsen Department of Systems Biology, DTU

Fill out the table

itr W1 W2 O

0 0.1 0.1

1

2

What are the weights after 2 forward/backward iterations with the given input, and has the error decrease (use =0.1, t=1)?

1 0

W1=0.1 W2=0.1

Linear function

o

Page 14: Optimization methods Morten Nielsen Department of Systems Biology, DTU

Monte Carlo

Because of their reliance on repeated computation of random or pseudo-random numbers, Monte Carlo methods are most suited to calculation by a computer. Monte Carlo methods tend to be used when it is unfeasible or impossible to compute an exact result with a deterministic algorithmOr when you are too stupid to do the math yourself?

Page 15: Optimization methods Morten Nielsen Department of Systems Biology, DTU

Monte Carlo (Minimization)

dE<0dE>0

Page 16: Optimization methods Morten Nielsen Department of Systems Biology, DTU

Gibbs sampler. Monte Carlo simulations

RFFGGDRGAPKRGYLDPLIRGLLARPAKLQVKPGQPPRLLIYDASNRATGIPA GSLFVYNITTNKYKAFLDKQ SALLSSDITASVNCAK GFKGEQGPKGEPDVFKELKVHHANENI SRYWAIRTRSGGITYSTNEIDLQLSQEDGQTIE

RFFGGDRGAPKRGYLDPLIRGLLARPAKLQVKPGQPPRLLIYDASNRATGIPAGSLFVYNITTNKYKAFLDKQ SALLSSDITASVNCAK GFKGEQGPKGEPDVFKELKVHHANENI SRYWAIRTRSGGITYSTNEIDLQLSQEDGQTIE

E1 = 5.4 E2 = 5.7

E2 = 5.2

dE>0; Paccept =1

dE<0; 0 < Paccept < 1

Note the sign. Maximization

Page 17: Optimization methods Morten Nielsen Department of Systems Biology, DTU

Monte Carlo Temperature

• What is the Monte Carlo temperature?

• Say dE=-0.2, T=1

• T=0.001

Page 18: Optimization methods Morten Nielsen Department of Systems Biology, DTU

MC minimization

Page 19: Optimization methods Morten Nielsen Department of Systems Biology, DTU

Monte Carlo - Examples

• Why a temperature?

Page 20: Optimization methods Morten Nielsen Department of Systems Biology, DTU

Local minima

Page 21: Optimization methods Morten Nielsen Department of Systems Biology, DTU

• A prediction method contains a very large set of parameters

– A matrix for predicting binding for 9meric peptides has 9x20=180 weights

• Over fitting is a problem

Data driven method training

yearsTe

mperature

Page 22: Optimization methods Morten Nielsen Department of Systems Biology, DTU

ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSVMRSGRVHAVVRFNIDETPANYIGQDGLAELCGDPGDQTRAVADGKGRPVPAAHPMTAQWWLDAFARGVVHVILQRELTRLQAVAEEMTKS

Evaluation of predictive performance• Train PSSM on raw data

– No pseudo counts, No sequence weighting– Fit 9*20 parameters to 9*10 data points

• Evaluate on training data–PCC = 0.97–AUC = 1.0

• Close to a perfect prediction method

Bin

ders

Non

e B

ind

ers

Page 23: Optimization methods Morten Nielsen Department of Systems Biology, DTU

AAAMAAKLAAAKNLAAAAAKALAAAARAAAAKLATAALAKAVAAAIPELMRTNGFIMGVFTGLNVTKVVAWLLEPLNLVLKVAVIVSVPFMRSGRVHAVVRFNIDETPANYIGQDGLAELCGDPGDQTRAVADGKGRPVPAAHPMTAQWWLDAFARGVVHVILQRELTRLQAVAEEMTKS

Evaluation of predictive performance• Train PSSM on Permuted (random) data

– No pseudo counts, No sequence weighting– Fit 9*20 parameters to 9*10 data points

• Evaluate on training data–PCC = 0.97–AUC = 1.0

• Close to a perfect prediction method AND• Same performance as one the original data

Bin

ders

Non

e B

ind

ers

Page 24: Optimization methods Morten Nielsen Department of Systems Biology, DTU

Repeat on large training data (229 ligands)

Page 25: Optimization methods Morten Nielsen Department of Systems Biology, DTU

Cross validation

Cross validation

Train on 4/5 of dataTest/evaluate on 1/5=>Produce 5 different methods each with a different prediction focus

Page 26: Optimization methods Morten Nielsen Department of Systems Biology, DTU

Model over-fitting

2000 MHC:peptide binding dataPCC=0.99

Evaluate on 600 MHC:peptide binding dataPCC=0.80

Page 27: Optimization methods Morten Nielsen Department of Systems Biology, DTU

Model over-fitting (early stopping)

Evaluate on 600 MHC:peptide binding dataPCC=0.89

Stop training

Page 28: Optimization methods Morten Nielsen Department of Systems Biology, DTU

What is going on?

years

Temperature

Page 29: Optimization methods Morten Nielsen Department of Systems Biology, DTU

5 fold training

Which method to choose?

Page 30: Optimization methods Morten Nielsen Department of Systems Biology, DTU

5 fold training

Page 31: Optimization methods Morten Nielsen Department of Systems Biology, DTU

Method evaluation

• Use cross validation• Evaluate on concatenated data and not

as an average over each cross-validated performance

Page 32: Optimization methods Morten Nielsen Department of Systems Biology, DTU

Method evaluation

Which prediction to use?

Page 33: Optimization methods Morten Nielsen Department of Systems Biology, DTU

Method evaluation

Page 34: Optimization methods Morten Nielsen Department of Systems Biology, DTU

SMM - Stabilization matrix method

I1 I2

w1 w2

Linear function

o

Per target:

Global:

Sum over weights

Sum over data points

Page 35: Optimization methods Morten Nielsen Department of Systems Biology, DTU

SMM - Stabilization matrix method

I1 I2

w1 w2

Linear function

o

l per target

Page 36: Optimization methods Morten Nielsen Department of Systems Biology, DTU

SMM - Stabilization matrix method

I1 I2

w1 w2

Linear function

o

Page 37: Optimization methods Morten Nielsen Department of Systems Biology, DTU

SMM training

Evaluate on 600 MHC:peptide binding dataL=0: PCC=0.70L=0.1 PCC = 0.78

Page 38: Optimization methods Morten Nielsen Department of Systems Biology, DTU

SMM - Stabilization matrix methodMonte Carlo

I1 I2

w1 w2

Linear function

o

Global:

• Make random change to weights

• Calculate change in “global” error

• Update weights if MC move is accepted

Note difference between MC and GD in the use of “global” versus “per target” error

Page 39: Optimization methods Morten Nielsen Department of Systems Biology, DTU

Training/evaluation procedure

• Define method• Select data• Deal with data redundancy

– In method (sequence weighting)– In data (Hobohm)

• Deal with over-fitting either– in method (SMM regulation term) or– in training (stop fitting on test set

performance)• Evaluate method using cross-validation

Page 40: Optimization methods Morten Nielsen Department of Systems Biology, DTU

A small doit script/usr/opt/www/pub/CBS/courses/27623.algo/exercises/code/SMM/doit_ex

#! /bin/tcsh foreach a ( `cat allelefile` )

mkdir -p $cd $a

foreach l ( 0 1 2.5 5 10 20 30 )

mkdir -p l.$lcd l.$l

foreach n ( 0 1 2 3 4 )

smm -nc 500 -l $l train.$n > mat.$npep2score -mat mat.$n eval.$n > eval.$n.pred

end

echo $a $l `cat eval.?.pred | grep -v "#" | gawk '{print $2,$3}' | xycorr`

cd ..

end

cd ..

end