gapped alignment theory - web.math.ku.dkweb.math.ku.dk/~richard/courses/binf_project/bjarni.pdf ·...

Gapped alignment theoryMott R, Tribe R. (1999) Approximate statistics of gapped alignments. J Comput Biol. Spring;6(1):91-112.

andMott R. (2000) Accurate formula for P-values of gapped local sequence and profile alignments. J Mol Biol.

Jul 14;300(3):649-59.

Bjarni Vilhjá[email protected]

Bioinformatics centre14. December, 2004.

Gapped alignment theory – p. 1/32

Overview• Repetition of statistical theory for ungapped

alignment.• Statistical theory for gapped alignment

- Maximum Likelihood method (Tommy’s paper)

- Greedy Extension Model (GEM)

- Statistical theory.

- The θ and κ functions.

• Application of the model to real data.

• Comparison to Profile methods.• Concluding remarks


Ungapped alignment theoryThe score Z for an ungapped sequence isapproximately extreme value distributed

Pr(Z > t) ∼ 1 − exp(

−Kmne−λt)

(1)

when n, m, t → ∞ and log mlog n → 1. That is the logp-value is approximately linear for a large t

− log (Pr(Z > t)) ∼ − log (Kmn)) + λt. (2)

• Note that the score can only take discrete values.This can be fixed to some extent by carefullychoosing values for the K constant.


Ungapped alignment theory• A bit more formal theory is needed before we

continue.

Given two independent finite sequences Ui and Vj ,where pa is the probability for a in U and qa for a inV . And a score Sab for matching letters a, b, such thatthe expected score is negative, i.e.

∑

ab

paqbSab < 0. (3)

Furthermore lets define a dot matrix

Mij ≡ SUi,Vj . (4)


Ungapped alignment theoryNow lets define sums of consecutive scores as

Yijk =

{

0 k = 0∑n 0(5)

The constrained maximum sum Zij is the maximumattained by Yijk such that k < Tij , where Tij is thefirst value of k for which Yijk is negative, i.e.

Zij = maxk≤Tij

Yijk. (6)

Define Cij to be the coordinate where the Yijk takesthe maximum in the constrained sum.


Ungapped alignment theoryUnder the assumptions mentioned previously and thatthe diagonals in the dot matrix are independent, theglobal maximum of the constrained sum

Zmax = maxij

Zij (7)

is approximately extreme value distributed, i.e.

Pr(Zmax > t) ∼ 1 − exp(

−Kmne−λt)

. (8)


Ungapped alignment theoryThe λ and K constants

Like we have seen in earlier lectures these constantsK and λ can be calculated. The λ value is found bysolving

∑

ij

h(x)eλSij = 1 (9)

and it can be interpreted as a some sort of scalingconstant for the score matrix used. Here h(x) isprobability density for score values M , i.e.

Pr (M = x) ≡ h(x) =∑

ij:Sij=x

piqj. (10)

The K constant has a much more complex analyticalformula, which I will not give here.


Ungapped alignment theorySumming up

• This theory works well for ungapped alignments,under the assumptions that both n, m and thethreshold t are large (and more assumptions).

• In modern applications, gapped alignments arehowever much more common.

• This has partly been solved by estimatingparameters of an extreme value distribution, fordifferent scoring schemes and gap penalties.

• Small errors in λ and K can result in large errorsin the p-value.

• A better statistical theory for gapped alignmentshas been greatly missed.


Gapped alignmentsGaps are normally modelled in scoring schemes by anaffine gap penalty, i.e.

g(n) = A + Bn, (11)

where A and B are the gap-opening andgap-extension parameters.


Gapped alignmentsMaximum Likelihood estimation

As shown in by Mott (1992), it is possible to findparameters for a general extreme value distribution

Pr(Zmax > t) =∼ 1 − exp(

−e−(t−T

S ))

, (12)

such that it can be used to model gaps to a certainextent. Here S is a scale parameter and T is atranslation parameter. The best results were found bychoosing

S = s1λ and T = t0 +t1 + t2 log(mn)

λ(13)

where MLE is done on the parameter set(a0, a1, a2, b1). Gapped alignment theory – p. 10/32

Gapped alignmentsGreedy Extension Model (GEM)

• In order to understand the ideas behind theformula for of gapped alignment given in Mott(2000), it is important to understand GEM.

Using the same notation as previously, then lets definethe unconstrained maximum score as

Z ′ij = maxk≥0

Yijk. (14)

Now with the affine gap function g(n), as definedearlier, GEM works in 5 steps as follows.



1. Consider all the HSPs in the dot matrix M ,i.e. those coordinates for which Zij > t, where tis some appropriate threshold. Let Zij be themaximum sum for one such HSP, starting at (i, j)and ending at Cij = (i + k − 1, j + k − 1).

2. (Greedy step) From (i + k, j + k) scanneighbouring diagonals to find the best place toextend the alignment. This place is found bymaximizing U1, where

U1 = maxn>0

(

Z ′i+k,j+k+l − g(l), Z′i+k+l,j+k − g(l)

)

.

(15)



3. Repeat greedy step U1, U2, . . . and generateU -walks U1, U1 + U2, . . .

4. Let T (Z) denote the first time the U -walk dropsbelow Z. Now define a W score

W (Z) = Z + max(0, U1, U1 + U2, . . . ,

U1 + · · · + UT (Z)). (16)

5. Finally define WGEM as the maximum of all theW scores, i.e.

WGEM = maxij

(W (Zij)) (17)



• The GEM algorithm results often in an optimallocal alignment, WGEM .

• The algorithm is however only a heuristic andcannot be guaranteed to find the optimal localalignment.

• Large gap-penalties result in good alignments.• GEM is interesting because there exist

probabilistic asymptotic results for it, (i.e. forlarge n, m, t and a gap cost function f(n) we canestimate Pr(WGEM > t)).


Gapped alignmentsStatistical theory

• Assuming that GEM alignment is close to theoptimal local gapped alignment (e.g. thealignment found by the Waterman-Smithalgorithm), we can estimate the probability of anoptimal local gapped alignment.

• The probability distribution for the GEM score,can be captured in only changing values for theconstants K and λ.

• These new Kg and λg values are related to the oldKu and λu values. (From now on g refers to thegapped model and u to the ungapped model).



In order to write out formulas for Kg and λg let usdefine a parameter α

α = 2s∑

k>0

e−λug(k) (18)

where s and the entropy H are

s =

√

δKu

H(eλuδ − 1)

H =∑

x

xh(x)eλux.

Here δ is the smallest span of score values.Gapped alignment theory – p. 16/32


Now Kg can be approximated under the GEM modelas

Kg = Kuκ(α) (19)

and λg asλg = λuθ(α). (20)

Where α depends on the gap function and is 0 only ifno gaps are allowed and grows as gap penalties lower.

• In practice the critical value for α lies in0.3 < αcrit < 0.4.

• Under normal conditions in BLAST and FASTAα < 0.2.


Gapped alignmentsThe θ and κ functions

• In the Mott (2000) paper the functions κ and θare approximated as linear functions of α,whereas in Mott and Tribe (1999) moreanalytically complicated upper and lower boundsare given for κ and θ.

• When θ̂ is estimated by ML method, then θ̂ isapproximately linear as a function of f(n, m),where

f(n, m) = log(mn)

(

1

n+

1

m

)

. (21)



The scoring schemes in A and B in the previous figurewere chosen in order to represent extreme behaviourof θ. From this we can expect a straight line fit, i.e.

θ = θinf + βf(n, m). (22)

Furthermore β̂ was fitted to be approximately linear interms of α and 1

Hand θ̂inf to be approximately linear

in terms of α.Similar analysis can be applied to kappa as well.



In all the best fitting equation for θ was

λg = λu(1.013 − 2.61α +

f(n, m)

(

−0.76 + 9.34α +1.12

H

))

(23)

and similarly by analysing log κ we get

Kg = Ku exp(0.26 − 18.92α +

f(n, m) (−1.76 + 32.69α+

192.52α2 +3.24

H

))

(24)


Application to real data• In order to test the formula it the prediction was

compared with the SSEARCH program.• SSEARCH is a sensitive method which fits an

extreme value distribution to the scores.• The PDB40D_J dataset, which is comprised of

935 sequences of known structure from the SCOPdatabase, was used.

• A two step procedure was used, where the Kgand λg values were first used to filter out theworst scoring alignments (p0 > 0.001).Afterwards a more sensitive method, which tooksequence compositions into account, was used tocalculate a new p-value p1.

• In general p1 > p0. Gapped alignment theory – p. 25/32

Application to real data

• The formula outperformed SSEARCH, as well asFASTA ktup = 1 and WU-BLAST.


Comparison to profile methods• A profile is a position dependent score matrix

Si(a), and a position independent gap function.• The density function for a score x is

h(x) =∑

i,a:Si(a)

pa

L, (25)

where pa is the probability for a occurring at agiven position. Using this formula, one canderive a λu by plugging it into equation (9).

• Since L is normally small we can approximatep-values using equations (23) and (24).


Comparison to profile methods• To test whether the formula could be used to

assess statistical significance of profile methods,1364 profiles were constructed by runningPSI-BLAST using PFAM-A domains as seeds.

• For each profile, 10000 comparisons were madebetween a shuffled version of the profile and arandomly generated sequence (350 aa).

• An extreme value distribution was fit to the data,and compared with the values given by theformula.

• Triangles denote short profiles (L < 75).


Comparison to profile methods


Concluding remarksThe good stuff

• Gapped alignment behaviour is wellcharacterized by a single parameter α, givenungapped parameters such as λu, Ku and H .

• The formula is accurate for practical applications,provided α < 0.2.

• Results from applying the formula to thePDB40D_J database indicated that the formulaworks well.

• Improvements in specificity (a good statisticalmodel) can ultimately lead to improvements insensitivity via better Profile generation.


Concluding remarksThe not so good stuff

• Extremes in compositions were not capturedcompletely by the model.

• The computational cost for evaluating λg and Kgare considerable, when calculating new values foreach alignment.

• Real protein sequences have complexdependencies which is not captured by a randomsequence model.


The end

Thanks

and

Merry Christmas


OverviewUngapped alignment theoryUngapped alignment theoryUngapped alignment theoryUngapped alignment theoryUngapped alignment theory\ ormalsize The $lambda $ and $K$ constantsUngapped alignment theory\ ormalsize Summing upGapped alignmentsGapped alignments\ ormalsize Maximum Likelihood estimationGapped alignments\ ormalsize Greedy Extension Model (GEM)Gapped alignments\ ormalsize Greedy Extension Model (GEM)Gapped alignments\ ormalsize Greedy Extension Model (GEM)Gapped alignments\ ormalsize Greedy Extension Model (GEM)Gapped alignments\ ormalsize Statistical theoryGapped alignments\ ormalsize Statistical theoryGapped alignments\ ormalsize Statistical theoryGapped alignments\ ormalsize The $heta $ and $kappa $ functionsGapped alignments\ ormalsize The $heta $ and $kappa $ functionsGapped alignments\ ormalsize The $heta $ and $kappa $ functionsGapped alignments\ ormalsize The $heta $ and $kappa $ functionsGapped alignments\ ormalsize The $heta $ and $kappa $ functionsGapped alignments\ ormalsize The $heta $ and $kappa $ functionsGapped alignments\ ormalsize The $heta $ and $kappa $ functionsApplication to real dataApplication to real dataComparison to profile methodsComparison to profile methodsComparison to profile methodsConcluding remarks \ ormalsize The good stuffConcluding remarks\ ormalsize The not so good stuffThe end

gapped alignment theory - web.math.ku.dkweb.math.ku.dk/~richard/courses/binf_project/bjarni.pdf ·...

Documents