multiple sequence alignment by genetic algorithms · 9 biological ideas used in genetic algorithms...

28
1 Bi04b_1 © Copyright W. Schreiner 2005 Multiple Multiple sequence sequence alignment alignment by by Genetic Genetic Algorithms Algorithms Unit 04b: Bi04b_2 © Copyright W. Schreiner 2005 The Darwinian principle of survival of the fittest asexual mutation operation sexual recombination (crossover) operation inversion operation gene regulation gene duplication gene deletion embryos development of embryo into organism 9 Biological ideas used in Genetic Algorithms (GA) and Genetic Programming (GP) from Koza (1993) ... ... What What Biology Biology can can do do for for Computer Science... Computer Science...“

Upload: others

Post on 18-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

1

Bi04b_1

© Copyright W. Schreiner 2005

Multiple Multiple sequencesequence alignmentalignment bybyGeneticGenetic AlgorithmsAlgorithms

Unit 04b:

Bi04b_2

© Copyright W. Schreiner 2005

The Darwinian principle of survival of the fittestasexual mutation operationsexual recombination (crossover) operationinversion operationgene regulationgene duplicationgene deletionembryosdevelopment of embryo into organism

9 Biological ideas used in Genetic Algorithms (GA) and Genetic Programming (GP)

from Koza (1993)

„„......WhatWhat BiologyBiology cancan do do forfor Computer Science...Computer Science...““

Page 2: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

2

Bi04b_3

© Copyright W. Schreiner 2005

Definition

The GA is a very general computational approach thatcan be tailored to solve optimization - or search tasks forvery different problems and settings.

The genetic algorithm is a mathematical algorithm thattransforms a set (population) of mathematical objects(typically fixed-length binary character strings), each withan associated fitness value, into a new set (new generation of the population) of offspriing objects, using operationspatterned after naturally-occurring genetic operations and the Darwinian principle of reproduction and survival of thefittest.

from Koza (1993)

GeneticGenetic AlgorithmAlgorithm (GA)(GA)

Bi04b_4

© Copyright W. Schreiner 2005

GeneticGenetic AlgorithmAlgorithm, , ConceptsConcepts

GENETIC OPERATIONSindividuals

Parent population

loop

Fitness score

3

1

4

5

6

2

offspring population

Fitness score

4

2

2

4

7

2

Page 3: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

3

Bi04b_5

© Copyright W. Schreiner 2005

011011AZKLMN

beginloopadd

end

beginaddmet

end

computer programs GAs on programs:„genetic programming“

binary stringscharacter stringstree structuretechnicalconstructions, differing in detail

representations directlyusable by GA

TypesTypes of of individualsindividuals forfor GAGA

Bi04b_6

© Copyright W. Schreiner 2005

particular (special) fold of a protein

a particular journeyfor the travellingsalesman

a particularalignment of N „msa via GAs“sequences

EFB

C

DA

TypesTypes of of individualsindividuals forfor GA, GA, ctdctd..

Page 4: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

4

Bi04b_7

© Copyright W. Schreiner 2005

vascular treesupplying N sitesof tissue

Preparatory Step 1 to implement a GA:

Recast the representation of individuals to strings or trees

TypesTypes of of individualsindividuals forfor GA, GA, ctdctd..

Bi04b_8

© Copyright W. Schreiner 2005

Fitness Fitness scoresscores forfor typestypes of of individualsindividuals

GENETIC OPERATIONSindividuals

Parent population

loop

Fitness score

3

1

4

5

6

2

offspring population

Fitness score

4

2

2

4

7

2

Page 5: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

5

Bi04b_9

© Copyright W. Schreiner 2005

maximum load - weight of material

beginloopadd

end

beginaddmet

end

rate of successful perceptions, TP, FP, TN, FN

- Energy of protein

Problem to optimize Fitness Score, example

Fitness Fitness scoresscores forfor typestypes of of individualsindividuals, , ctdctd..

Bi04b_10

© Copyright W. Schreiner 2005

alignment score

- Length of journey

- Blood volume

EFB

C

DA

Preparatory Step 2 for implementing a GA:

Define algorithm to compute fitness score

Fitness Fitness scoresscores forfor typestypes of of individualsindividuals, , ctdctd..

Page 6: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

6

Bi04b_11

© Copyright W. Schreiner 2005

GeneticGenetic operationsoperations forfor GAGA

GENETIC OPERATIONSindividuals

Parent population

loop

Fitness score

3

1

4

5

6

2

offspring population

Fitness score

4

2

2

4

7

2

Bi04b_12

© Copyright W. Schreiner 2005

Darwinian reproduction (copy operation):

Individuals with higher fitness (F) are stochasticallychosen more likely, e.g. via p ≅ 1-e-kF

best individuals are not necessarily chosenworst individual is not necessarily excludedA certain fraction of population undergoes reproduction(either exact or randomly selected)

sexual crossoverasexual Mutation

ACKBDF ACKBDF

GeneticGenetic operationsoperations

Page 7: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

7

Bi04b_13

© Copyright W. Schreiner 2005

Darwinian reproduction (asexual copy operation)

sexual crossover

asexual mutation

from Brown (1999)

GeneticGenetic operationsoperations, , ctdctd..

high resolution

Bi04b_14

© Copyright W. Schreiner 2005

Darwinian reproduction (asexual copy operation)

sexual crossover

asexual mutation

GeneticGenetic operationsoperations, , ctdctd..

highresolution

Page 8: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

8

Bi04b_15

© Copyright W. Schreiner 2005

Darwinian reproduction (asexual copy operation)

sexual crossover

asexual mutation

The predominant operation with GAsA certain fraction of individuals goes into „matingpool“ based on fitness. Or: tournament selection: matebest bull with best cow.Two parental individuals (strings, trees) are chosenbased on fitnessPick a point in the genome (the same for both parents) to become the recombinant joint

GeneticGenetic operationsoperations, , ctdctd..

Bi04b_16

© Copyright W. Schreiner 2005

Darwinian reproduction (asexual copy operation)

sexual crossover

asexual mutation

BranchBranch migrationmigration shiftsshifts recombinantrecombinant jointjoint

high resolution

Page 9: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

9

Bi04b_17

© Copyright W. Schreiner 2005

for string representation:

pick position of recombinant joint stochastically between 1 and L-1 (L = length of genome representation string)

join recombinants

parential„string chromosomes“

offspring„string chromosomes“

sexual sexual crossovercrossover, , detailsdetails

father

mother 1 2 ... ... L-1 L

1 2 ... ... L-1 L

... L-1 L1 2 ...

... L-1 L1 2 ...

Bi04b_18

© Copyright W. Schreiner 2005

color, hair length in this example: inherited together (linked features)

color, # of legs in this example: inherited separately (independent)

sexual sexual crossovercrossover with with stringsstrings

red color long hair

A K B L Y A B L M M L L K AR T S S R R A A A L L B M N N L

A K B L Y A B L M M M A A KR T S S A A L A B B L M M N A L

B L L L B S A L M M L L K AA R T S R R A A A L L B M N N L

B L L L B S A L M M M A A KA R T S A A L A B B L M M N A L

gene 1 gene 2 gene 3 gene 4 gene 5 gene 6

3 legs

blue color short hair 4 legs

blue color

red color

short hair

long hair 4 legs

3 legs

Page 10: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

10

Bi04b_19

© Copyright W. Schreiner 2005

features on genes close to each other: likely to be transmitted linkedto each other

features on genes lying far apart: more likely to be disruptedand transmitted independentlyfrom each other

sexual sexual crossovercrossover with with stringsstrings, , ctdctd..

red color long hair

A K B L Y A B L M M L L K AR T S S R R A A A L L B M N N L

A K B L Y A B L M M M A A KR T S S A A L A B B L M M N A L

B L L L B S A L M M L L K AA R T S R R A A A L L B M N N L

B L L L B S A L M M M A A KA R T S A A L A B B L M M N A L

gene 1 gene 2 gene 3 gene 4 gene 5 gene 6

3 legs

blue color short hair 4 legs

blue color

red color

short hair

long hair 4 legs

3 legs

Bi04b_20

© Copyright W. Schreiner 2005

From model we can see:features on genes close to each other: likely to be transmitted linked

to each other

features on genes lying far apart: more likely to be disruptedand transmitted independentlyfrom each other

For „real“ genetics:

reverse the argument to define gene-distance:

observe frequency for linked VS: independent transmittence of features

→ derive a measure of gene-distance (CM: centi Morgan) withingenome-maps (physical maps)

sexual sexual crossovercrossover with with stringsstrings, , ctdctd..

Page 11: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

11

Bi04b_21

© Copyright W. Schreiner 2005

Gene Distance and Gene Gene Distance and Gene MapsMaps

high resolutionfrom Brown (1999)

Zwei Gene, die relativ eng benachbart auf einem Chromosom liegen, werden durch ein Crossing-over mit geringerer Wahrscheinlichkeit entkoppelt als solche, die weiter voneinander entfernt sind. Weiße Augen (w) und gelbe Körper (y) rekombinierendeshalb seltener als weiße Augen und kleine Flügel (m).

Sturtevants Karte für fünf Gene des X-Chromosoms von Drosophila. Abkürzungen: y, gelber Körper; w, weiße Augen; v, zinnoberrote Augenm, kleine Flügel; r, rudimentäre Flügel.

Bi04b_22

© Copyright W. Schreiner 2005

from Koza (1993)

GAsGAs & & CoCo--adaptedadapted Sets of genesSets of genes

• Genes close to each other on the chromosome are less likely to besaparated by a crossover. Therefore place things adjacent if they area good combination (e.g. long legs and long neck).

→ ability of the GA to solve the problem depends on this kind of choices.

• In nature: if cooperative beneficial features get together close (dueto crossover) they are from then on inherited more effectivelytogether (called „inversion“).

• General idea how GAs work: they generate coadapted pairs that tendto get commoted in the population

Page 12: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

12

Bi04b_23

© Copyright W. Schreiner 2005

real genes usually stand for a feature (color) which isexpressed/ not expressed (but they don‘t toggle betweenfeatures, e.g. colors). GA-genomes normally toggle.

In addition to crossover - modeled after nature-in GAs many other artificial genetic operators may befreely designed to fit the representation and performspecifically suitable jobs in optimization (will be shown in examples)

CaveatsCaveats regardingregarding analogyanalogy to natural to natural crossovercrossover

Bi04b_24

© Copyright W. Schreiner 2005

VERY occasional - maybe 1 bit/character per generationChoose one parental string (asexual) based on fitness.Pick point from 1 to L (using a uniform randomdistribution)

Mutation is a localized search, changing one factor only!Similar to Monte Carlo Move in Gibbs sampling mode!Point 3 chosen and mutated

A L L M A A K

A L K M A A K

parent

offspring

from Koza (1993)

Mutation OperationMutation Operation

Page 13: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

13

Bi04b_25

© Copyright W. Schreiner 2005

individual Generation 0 Mating pool for sexual crossover

Generation 1

genome fitness prob genome fitness genome fitness prob

1 011 3 .25 011 3 111 7 0.39

2 001 1 .08 110 6 010 2 0.11

3 110 6 .50 110 6 110 6 0.33

4 010 2 .17 010 2 011 3 0.17

Total 12 17 18

Worst 1 2 2

Average 3 4.25 4.5

Best

6

6

7

Selected

Select individuals of population for mating mating pool by chance, according to fitness

NOT selected

selected selected twice from Koza (1993)

GA GA exampleexample runrun (4 (4 individualsindividuals, , 33--dimensional dimensional optimizationoptimization))

Bi04b_26

© Copyright W. Schreiner 2005

individual Generation 0 Mating pool for sexual crossover

Generation 1

genome fitness prob genome fitness genome fitness prob

1 011 3 .25 011 3 111 7 0.39

2 001 1 .08 110 6 010 2 0.11

3 110 6 .50 110 6 110 6 0.33

4 010 2 .17 010 2 011 3 0.17

Total 12 17 18

Worst 1 2 2

Average 3 4.25 4.5

Best

6

6

7

Perform genomic operations on members of mating pool by chance, according to fitness

copy

mutation

crossover

from Koza (1993)

GA GA exampleexample runrun (4 (4 individualsindividuals, , 33--dimensional dimensional optimizationoptimization))

Page 14: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

14

Bi04b_27

© Copyright W. Schreiner 2005

Creation of the initial random population (generation 0) (uniform distribution)Probabilistic selection of participant(s) for the genetic operation(unequal probabilities, based on fitness)Probabilistic selection of the type of operation (unequalprobabilities)Probabilistic selection of crossover or mutation point (equal orunequal probabilities)(Often) probabilistic selection of fitness cases (uniform distribution)In each run you get a different answer: A GA is a multi-run-algo

from Koza (1993)

GeneticGenetic AlgorithmsAlgorithms areare probabilisticprobabilistic

Bi04b_28

© Copyright W. Schreiner 2005

Problem areas involving many variables that areinterrelated in highly non-linear waysProblem areas involving many variables whose inter-relationship is not well understoodProblem areas where a good approximate solution issatisfactory (and no one is expecting a perfect solution)

designcontrolclassification, pattern recognition, image processingforecastingmodel building and data mining

from Koza (1993)

PromisingPromising GA GA ApplicationApplication AreasAreas

Page 15: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

15

Bi04b_29

© Copyright W. Schreiner 2005

Problem areas where discovery of the size and shape of thesolution is a major part of the problemProblem areas where large computerized databases areaccumulating and computerized techniques are needed to analyze the data

genome and protein sequencessatellite dataastronomypetroleumfinancial databasesmarketing databasesWorld Wide Web

from Koza (1993)

PromisingPromising GA/GP GA/GP AreasAreas, , ctdctd..

Bi04b_30

© Copyright W. Schreiner 2005

Problem areas for which human find it very difficult to write good programs.................

from Koza (1993)

PromisingPromising GA/GP GA/GP ApplicationApplication AreasAreas, , ctdctd..

Page 16: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

16

Bi04b_31

© Copyright W. Schreiner 2005

Initialisation

Evaluation

Breeding

End

1. create G0

2. evaluate the population of generation n (Gn)

3. if the population is stabilised then End

4. select the individuals to replace

5. evaluate the expected number of offspring (EO) for each individual (fitness based)

6. select the parent(s) from Gn

7. select the operator

8. generate the new child

9. keep or discard the new child in Gn+1

10. goto 6 until all the children have been successfully put into Gn+1

11. n = n+1

12. goto Evaluation

13. end

Multiple Multiple sequencesequence alignmentalignment byby GA GA ProgramProgram SAGASAGA((afterafter NotredameNotredame C, Higgins DG: SAGA: C, Higgins DG: SAGA: sequencesequence alignmentalignment byby geneticgenetic algorithmalgorithm. .

NucleicNucleic AcidsAcids Res. 1996;24:1515Res. 1996;24:1515--1524)1524)

Bi04b_32

© Copyright W. Schreiner 2005

generate population of 100 random alignments

each individual (alignment) contains only terminal gaps

chose random offset (e.g. between 0 and 50 for each sequence)

pad with leading and trailing gaps to make all sequences equally long.

Initialisation:

1. 1. InitialisationInitialisation in SAGAin SAGA

Page 17: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

17

Bi04b_33

© Copyright W. Schreiner 2005

find length Lmax of longest sequence

choose unique length of allalignments LA > Lmax

for each sequence with length Li in alignment: toss for offset from equal probabilities between1 ≤ offset ≤ (LA-Li)+1

possible realization:---WGKVNVDEVGGEAL---WDKVNEEEVGGEAL---WGKVGAHAGEYGAEAL---WSKVGGHAGEYGAEAL

A1

--WGKVNVDEVGGEAL--WDKVNEEEVGGEAL-----WGKVGAHAGEYGAEAL---WSKVGGHAGEYGAEAL

A2

WGKVNVDEVGGEAL-------WDKVNEEEVGGEAL-WGKVGAHAGEYGAEAL---WSKVGGHAGEYGAEAL-

Am≈100

offset =1

offset =4

offset =2

InitialisationInitialisation, , exampleexample

Bi04b_34

© Copyright W. Schreiner 2005

1

2 1

1 ( )

1 . ( )−

= =

= =

= ∑∑N i

ij iji j

Fitness Alignment cost A

W cost A

Given a multiple alignment A of N sequences:

cost(Ai j) is computed from substitution matrices (PAMs,BLOSUMs) with affine gap penalities (variants exist for visà vis gaps)

cost of pairwise alignment

weight

2.2.--5. Evaluation in SAGA5. Evaluation in SAGA

Page 18: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

18

Bi04b_35

© Copyright W. Schreiner 2005

choose the best 50% of alignments of Gn for direct copy operation

(i.e.

evaluate for each alignment in Gn the expected number of offspring (EO) based on fitness. Usually 0 ≤ EO ≤ 2

4

5

will make up half of the next generationthis is a method of overlapping generationsindividuals not selected for copy will be replaced)

2.2.--5. Evaluation in SAGA, 5. Evaluation in SAGA, ctdctd..

Bi04b_36

© Copyright W. Schreiner 2005

stochastically (proportional to EO, unequal probabilities) select parents from Gn for the mating pool

stochastically select an operator (operators are specifically designed for a chosen representation scheme, see below).

apply operator to (1 or 2) parent(s) and generate children

check children for duplicats within generation Gn+1: if duplicates occur then discard parents & children and repeat from item 6.

6

7

8

9

repeat until enough children (e.g. 50% of population) are generated

6.6.--9. 9. BreedingBreeding in SAGA in SAGA ((geneticgenetic OP OP otherother thanthan copycopy))

Page 19: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

19

Bi04b_37

© Copyright W. Schreiner 2005

there is no theoretically proven criterium for convergence

heuristic stop, if no improvement found over the last 100 generations

3. 3. TerminationTermination condition in SAGA condition in SAGA

Bi04b_38

© Copyright W. Schreiner 2005

crossover (2 different types)gap insertionblock shufflingblock searchinglocal optimal or sub-optimal rearrangement

[ ]traditional operators: sexual crossoverasexual mutation

GA-operators specifically designed for multiple sequence alignments

2 modes of usage for each operator:

stochasticsemi hill-climbing

CloseupsCloseups on Operators in SAGA on Operators in SAGA

Page 20: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

20

Bi04b_39

© Copyright W. Schreiner 2005

acts on 2 parent alignments (sexual)1st parent is cut straight at random position2nd parent is tailored to let pieces fit together2 different children may be producedspaces at junction are filled with gaps

at random (stochastic mode)keep child

with better score (semi hill climbing mode)(this) operator is both: crossover + mutationoperator may disrupt coherent parts of a sequence

One point One point crossovercrossover in SAGA in SAGA

Bi04b_40

© Copyright W. Schreiner 2005

WGKVN---VDEVGGEAL-WDKVNEEE---VGGEAL-WGKVG--AHAGEYGAEALWSKVGGHA--GEYGAEAL

--WGKVN---VDEVGGEAL-WD--KVNEEE---VGGEAL-WGKV--G--AHAGEYGAEALWSKV--GGHA--GEYGAEAL

WGKV--NVDEVG-GEALWDKV--NEEEVG-GEALWGKVGA-HAGEYGAEALWSKVGGHAGE-YGAEAL

--WGKVNVDEVG-GEALWD--KVNEEEVG-GEALWGKVGA-HAGEYGAEALWSKVGGHAGE-YGAEAL

Parent Alignment 1 Parent Alignment 2

Child Alignment 1 Child Alignment 2

Chosen Child Alignment

+

WGKV--NVDEVG-GEALWDKV--NEEEVG-GEALWGKVGA-HAGEYGAEALWSKVGGHAGE-YGAEAL

from Notredame (1996)

One point One point crossovercrossover in SAGA, in SAGA, ctdctd. .

Page 21: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

21

Bi04b_41

© Copyright W. Schreiner 2005

acts on 2 parent alignments (sexual)designed after natural crossover(multiple) exchanges between parents are promoted between zones of homologyimplemented modes: stochastic or semi-hill-climbing

Uniform Uniform crossovercrossover in SAGA in SAGA

Bi04b_42

© Copyright W. Schreiner 2005

Parent Alignment 1

WG K VNVDEV-- G GEALWD K VNEEEV-- G GEALWG K VGAHAGEY G AEALWS K VGGHAGEY G AEAL

* *Parent Alignment 2

WG- K V--NVDEV G GE-ALW-D K V--NEEEV G G-EALW-G K VGAHAGEY G AEA-L-WS K VGGHAGEY G AEAL-

**

K * Position consistent

between the two parents

from Notredame (1996)

Uniform Uniform crossovercrossover in SAGA, in SAGA, ctdctd. .

Child Alignment 1 Child Alignment 2

WG K V--NVDEV G GEALWD K V--NEEEV G GEALWG K VGAHAGEY G AEALWS K VGGHAGEY G AEAL

**WG- K VNVDEV-- G GE-ALW-D K VNEEEV-- G G-EALW-G K VGAHAGEY G AEA-L-WS K VGGHAGEY G AEAL-

**

WGKV--NVDEVG-GEALWDKV--NEEEVG-GEALWGKVGA-HAGEYGAEALWSKVGGHAGE-YGAEAL

Chosen Child Alignment

+

Page 22: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

22

Bi04b_43

© Copyright W. Schreiner 2005

acts on 1 parent alignment (asexual)split sequences into 2 groups (G1, G2)(ideally derived from phylogenetic tree of sequences)choose insertion point P1 randomlyinsert random number of gaps at P1 into all sequences ∈G1

chooe insertion point P2 randomlyinsert same # of gaps at P2 into all sequences ∈G2

Above stochastic mode of operator can be made semi hill climbing by selecting everything randomly as above, except for P1. Try all possible P1 and take best.

⇒ all sequences increase in length by chosen # of gaps

GapGap insertioninsertion operatoroperator in SAGA in SAGA

Bi04b_44

© Copyright W. Schreiner 2005

Insertion of the gaps in the parent alignment (stochastic mode)

WGKVNVDEVGGEA-GLWDKVNEEEVGGEA-GLWGKVGAHAGEYGAEALWSKVGGHAGEYGAEALWAKVEADVAGHGQDIL

seq1seq2seq3seq4seq5

P1

P2

WGKV--NVDEVGGEA-GLWDKV--NEEEVGGEA-GLWGKVGAHAGEYGAEAL--WSKVGGHAGEYGAEAL--WAKVEADVAGHGQDIL--

gaps in G1

gaps inG2

from Notredame (1996)

GapGap insertioninsertion operatoroperator in SAGA, in SAGA, ctdctd. .

G1

G2

Page 23: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

23

Bi04b_45

© Copyright W. Schreiner 2005

Insertion of the gaps in the parent alignment (semi hill climbing mode)

WGKVNVDEVGGEA-GLWDKVNEEEVGGEA-GLWGKVGAHAGEYGAEALWSKVGGHAGEYGAEALWAKVEADVAGHGQDIL

seq1seq2seq3seq4seq5

„sliding P1“

P1

P2

For each possible position of P1 generate 1 child!

from Notredame (1996)

GapGap insertioninsertion operatoroperator in SAGA, in SAGA, ctdctd. .

G1

G2

sele

ct o

ptim

um c

hild

--WGKVNVDEVGGEA-GL--WDKVNEEEVGGEA-GLWGKVGAHAGEYGAEAL--WSKVGGHAGEYGAEAL--WAKVEADVAGHGQDIL--

...WGKVNVDEVGGEA-GL--WDKVNEEEVGGEA-GL--WGKVGAHAGEYGAEAL--WSKVGGHAGEYGAEAL--WAKVEADVAGHGQDIL--

W--GKVNVDEVGGEA-GLW--DKVNEEEVGGEA-GLWGKVGAHAGEYGAEAL--WSKVGGHAGEYGAEAL--WAKVEADVAGHGQDIL--

Bi04b_46

© Copyright W. Schreiner 2005

block definition, modified for shuffling operators:block = set of overlapping stretches of residues, each being delimited by a gap or by an end of sequence

WGKVN--VDEVGGEALWGKVGAHAGEYGAEALWDKV--NEEEVGGEALWSKVGGHAGEYGAEALWAKVEADVAGHGQDIL

WGKVN--VDEVGGEALWGKVGAHAGEYGAEALWDKV--NEEEVGGEALWSKVGGHAGEYGAEALWAKVEADVAGHGQDIL

from Notredame (1996)

Block Block shufflingshuffling in SAGAin SAGA

Page 24: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

24

Bi04b_47

© Copyright W. Schreiner 2005

Shuffling type 1: Move a full block of gaps(or a full block of residues).

WGKVN--VDEVGGEALWGKVGAHAGEYGAEALWDKV--NEEEVGGEALWSKVGGHAGEYGAEALWAKVEADVAGHGQDIL

WGKV--NVDEVGGEALWGKVGAHAGEYGAEALWDK--VNEEEVGGEALWSKVGGHAGEYGAEALWAKVEADVAGHGQDIL

from Notredame (1996)

Block Block shufflingshuffling in SAGA, in SAGA, ctdctd..

Example shown: Move block of gaps to left by 1.

Bi04b_48

© Copyright W. Schreiner 2005

Shuffling type 2: Split the block horizontally and moveone of the sub blocks to the left or tothe right. The subdivision of a block ismade according to the tree (cf. gapinsertion operator).

WGKVN--VDEVGGEALWGKVGAHAGEYGAEALWDKV--NEEEVGGEALWSKVGGHAGEYGAEALWAKVEADVAGHGQDIL

from Notredame (1996)

Block Block shufflingshuffling in SAGA, in SAGA, ctdctd..

WGKV--NVDEVGGEALWGKVGAHAGEYGAEALWDKV--NEEEVGGEALWSKVGGHAGEYGAEALWAKVEADVAGHGQDIL

Example: Shift left by 1 for block in group G1Shift nothing for block in group G2

seq1seq2seq3seq4seq5

G2

G1

Page 25: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

25

Bi04b_49

© Copyright W. Schreiner 2005

Shuffling type 3:Split the block vertically and move one half to the left orto the right.The move can be made stochastic or in a semi-hill climbing way, looking for the best position.

WGKVN--VDEVGGEALWGKVGAHAGEYGAEALWDKV--NEEEVGGEALWSKVGGHAGEYGAEALWAKVEADVAGHGQDIL

WGKVNV--DEVGGEALWGKVGAHAGEYGAEALWDKV-N-EEEVGGEALWSKVGGHAGEYGAEALWAKVEADVAGHGQDIL

from Notredame (1996)

Block Block shufflingshuffling in SAGA, in SAGA, ctdctd..

Bi04b_50

© Copyright W. Schreiner 2005

refer to conventional definition of block]select substring of (random length) in 1 sequencesearch all other sequences for best match to substringadd substring found to the old one to generate a profilein each of the other sequences find string best matching to profile and add it to profile. Search extends only over a window around profilemove strings inside sequences to reconstruct block

[

Block Block searchingsearching & & movingmoving operatoroperator, , ctdctd..

...Therefore we designed a crude method that ...

Page 26: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

26

Bi04b_51

© Copyright W. Schreiner 2005

This block searching mutation generates more dramatic changes than any of the other operators.

No further description givenin original paper

Block Block searchingsearching & & movingmoving operatoroperator, , ctdctd..

...Therefore we designed a crude method that ...

„...Inspecting the above derivation one can easily see that ...“

Bi04b_52

© Copyright W. Schreiner 2005

optimize gaps inside a given blockexhaustive (all possibilities) examinationlocal alignment via genetic algorithm (LAGA)

LocalLocal optimal & suboptimal optimal & suboptimal rearrangementrearrangement

Authors suggest additional and very heuristic manipulations, such as:

Page 27: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

27

Bi04b_53

© Copyright W. Schreiner 2005

22 operators in totalinitially each operator has probability 1/22computes running averages of efficiency for each operator based on improvement achievedinclude last operator, second last, etc. with decreasing weightsoperator usage probability: total improvement

number of children createdp = max (p, pmin), i.e. apply each operator at least with minimum usage probability

DynamicDynamic schedulingscheduling of of operatorsoperators in SAGAin SAGA

Bi04b_54

© Copyright W. Schreiner 2005

control chart to monitor usage

from Notredame (1996)

SelfSelf tuningtuning of of operatoroperator selectionselection in SAGAin SAGA

Page 28: Multiple sequence alignment by Genetic Algorithms · 9 Biological ideas used in Genetic Algorithms (GA) and ... a good combination (e.g. long legs and long neck). →ability of the

28

Bi04b_55

© Copyright W. Schreiner 2005

very satisfactory on test cases compared toother programs (e.g. CLUSTALW)satisfying and even superior to others when checked with alignments based on 3D-structure(golden Standard)the more sophisticated operators are essential, SAGA does not work with simple mutation &crossover!

Performance of SAGA, Performance of SAGA, SummarySummary