gapped alignment theory - web.math.ku.dkweb.math.ku.dk/~richard/courses/binf_project/bjarni.pdf ·...
TRANSCRIPT
-
Gapped alignment theoryMott R, Tribe R. (1999) Approximate statistics of gapped alignments. J Comput Biol. Spring;6(1):91-112.
andMott R. (2000) Accurate formula for P-values of gapped local sequence and profile alignments. J Mol Biol.
Jul 14;300(3):649-59.
Bjarni Vilhjá[email protected]
Bioinformatics centre14. December, 2004.
Gapped alignment theory – p. 1/32
-
Overview• Repetition of statistical theory for ungapped
alignment.• Statistical theory for gapped alignment
- Maximum Likelihood method (Tommy’s paper)
- Greedy Extension Model (GEM)
- Statistical theory.
- The θ and κ functions.
• Application of the model to real data.
• Comparison to Profile methods.• Concluding remarks
Gapped alignment theory – p. 2/32
-
Ungapped alignment theoryThe score Z for an ungapped sequence isapproximately extreme value distributed
Pr(Z > t) ∼ 1 − exp(
−Kmne−λt)
(1)
when n, m, t → ∞ and log mlog n → 1. That is the logp-value is approximately linear for a large t
− log (Pr(Z > t)) ∼ − log (Kmn)) + λt. (2)
• Note that the score can only take discrete values.This can be fixed to some extent by carefullychoosing values for the K constant.
Gapped alignment theory – p. 3/32
-
Ungapped alignment theory• A bit more formal theory is needed before we
continue.
Given two independent finite sequences Ui and Vj ,where pa is the probability for a in U and qa for a inV . And a score Sab for matching letters a, b, such thatthe expected score is negative, i.e.
∑
ab
paqbSab < 0. (3)
Furthermore lets define a dot matrix
Mij ≡ SUi,Vj . (4)
Gapped alignment theory – p. 4/32
-
Ungapped alignment theoryNow lets define sums of consecutive scores as
Yijk =
{
0 k = 0∑n 0(5)
The constrained maximum sum Zij is the maximumattained by Yijk such that k < Tij , where Tij is thefirst value of k for which Yijk is negative, i.e.
Zij = maxk≤Tij
Yijk. (6)
Define Cij to be the coordinate where the Yijk takesthe maximum in the constrained sum.
Gapped alignment theory – p. 5/32
-
Ungapped alignment theoryUnder the assumptions mentioned previously and thatthe diagonals in the dot matrix are independent, theglobal maximum of the constrained sum
Zmax = maxij
Zij (7)
is approximately extreme value distributed, i.e.
Pr(Zmax > t) ∼ 1 − exp(
−Kmne−λt)
. (8)
Gapped alignment theory – p. 6/32
-
Ungapped alignment theoryThe λ and K constants
Like we have seen in earlier lectures these constantsK and λ can be calculated. The λ value is found bysolving
∑
ij
h(x)eλSij = 1 (9)
and it can be interpreted as a some sort of scalingconstant for the score matrix used. Here h(x) isprobability density for score values M , i.e.
Pr (M = x) ≡ h(x) =∑
ij:Sij=x
piqj. (10)
The K constant has a much more complex analyticalformula, which I will not give here.
Gapped alignment theory – p. 7/32
-
Ungapped alignment theorySumming up
• This theory works well for ungapped alignments,under the assumptions that both n, m and thethreshold t are large (and more assumptions).
• In modern applications, gapped alignments arehowever much more common.
• This has partly been solved by estimatingparameters of an extreme value distribution, fordifferent scoring schemes and gap penalties.
• Small errors in λ and K can result in large errorsin the p-value.
• A better statistical theory for gapped alignmentshas been greatly missed.
Gapped alignment theory – p. 8/32
-
Gapped alignmentsGaps are normally modelled in scoring schemes by anaffine gap penalty, i.e.
g(n) = A + Bn, (11)
where A and B are the gap-opening andgap-extension parameters.
Gapped alignment theory – p. 9/32
-
Gapped alignmentsMaximum Likelihood estimation
As shown in by Mott (1992), it is possible to findparameters for a general extreme value distribution
Pr(Zmax > t) =∼ 1 − exp(
−e−(t−T
S ))
, (12)
such that it can be used to model gaps to a certainextent. Here S is a scale parameter and T is atranslation parameter. The best results were found bychoosing
S = s1λ and T = t0 +t1 + t2 log(mn)
λ(13)
where MLE is done on the parameter set(a0, a1, a2, b1). Gapped alignment theory – p. 10/32
-
Gapped alignmentsGreedy Extension Model (GEM)
• In order to understand the ideas behind theformula for of gapped alignment given in Mott(2000), it is important to understand GEM.
Using the same notation as previously, then lets definethe unconstrained maximum score as
Z ′ij = maxk≥0
Yijk. (14)
Now with the affine gap function g(n), as definedearlier, GEM works in 5 steps as follows.
Gapped alignment theory – p. 11/32
-
Gapped alignmentsGreedy Extension Model (GEM)
1. Consider all the HSPs in the dot matrix M ,i.e. those coordinates for which Zij > t, where tis some appropriate threshold. Let Zij be themaximum sum for one such HSP, starting at (i, j)and ending at Cij = (i + k − 1, j + k − 1).
2. (Greedy step) From (i + k, j + k) scanneighbouring diagonals to find the best place toextend the alignment. This place is found bymaximizing U1, where
U1 = maxn>0
(
Z ′i+k,j+k+l − g(l), Z′i+k+l,j+k − g(l)
)
.
(15)
Gapped alignment theory – p. 12/32
-
Gapped alignmentsGreedy Extension Model (GEM)
3. Repeat greedy step U1, U2, . . . and generateU -walks U1, U1 + U2, . . .
4. Let T (Z) denote the first time the U -walk dropsbelow Z. Now define a W score
W (Z) = Z + max(0, U1, U1 + U2, . . . ,
U1 + · · · + UT (Z)). (16)
5. Finally define WGEM as the maximum of all theW scores, i.e.
WGEM = maxij
(W (Zij)) (17)
Gapped alignment theory – p. 13/32
-
Gapped alignmentsGreedy Extension Model (GEM)
• The GEM algorithm results often in an optimallocal alignment, WGEM .
• The algorithm is however only a heuristic andcannot be guaranteed to find the optimal localalignment.
• Large gap-penalties result in good alignments.• GEM is interesting because there exist
probabilistic asymptotic results for it, (i.e. forlarge n, m, t and a gap cost function f(n) we canestimate Pr(WGEM > t)).
Gapped alignment theory – p. 14/32
-
Gapped alignmentsStatistical theory
• Assuming that GEM alignment is close to theoptimal local gapped alignment (e.g. thealignment found by the Waterman-Smithalgorithm), we can estimate the probability of anoptimal local gapped alignment.
• The probability distribution for the GEM score,can be captured in only changing values for theconstants K and λ.
• These new Kg and λg values are related to the oldKu and λu values. (From now on g refers to thegapped model and u to the ungapped model).
Gapped alignment theory – p. 15/32
-
Gapped alignmentsStatistical theory
In order to write out formulas for Kg and λg let usdefine a parameter α
α = 2s∑
k>0
e−λug(k) (18)
where s and the entropy H are
s =
√
δKu
H(eλuδ − 1)
H =∑
x
xh(x)eλux.
Here δ is the smallest span of score values.Gapped alignment theory – p. 16/32
-
Gapped alignmentsStatistical theory
Now Kg can be approximated under the GEM modelas
Kg = Kuκ(α) (19)
and λg asλg = λuθ(α). (20)
Where α depends on the gap function and is 0 only ifno gaps are allowed and grows as gap penalties lower.
• In practice the critical value for α lies in0.3 < αcrit < 0.4.
• Under normal conditions in BLAST and FASTAα < 0.2.
Gapped alignment theory – p. 17/32
-
Gapped alignmentsThe θ and κ functions
• In the Mott (2000) paper the functions κ and θare approximated as linear functions of α,whereas in Mott and Tribe (1999) moreanalytically complicated upper and lower boundsare given for κ and θ.
• When θ̂ is estimated by ML method, then θ̂ isapproximately linear as a function of f(n, m),where
f(n, m) = log(mn)
(
1
n+
1
m
)
. (21)
Gapped alignment theory – p. 18/32
-
Gapped alignmentsThe θ and κ functions
Gapped alignment theory – p. 19/32
-
Gapped alignmentsThe θ and κ functions
The scoring schemes in A and B in the previous figurewere chosen in order to represent extreme behaviourof θ. From this we can expect a straight line fit, i.e.
θ = θinf + βf(n, m). (22)
Furthermore β̂ was fitted to be approximately linear interms of α and 1
Hand θ̂inf to be approximately linear
in terms of α.Similar analysis can be applied to kappa as well.
Gapped alignment theory – p. 20/32
-
Gapped alignmentsThe θ and κ functions
Gapped alignment theory – p. 21/32
-
Gapped alignmentsThe θ and κ functions
Gapped alignment theory – p. 22/32
-
Gapped alignmentsThe θ and κ functions
In all the best fitting equation for θ was
λg = λu(1.013 − 2.61α +
f(n, m)
(
−0.76 + 9.34α +1.12
H
))
(23)
and similarly by analysing log κ we get
Kg = Ku exp(0.26 − 18.92α +
f(n, m) (−1.76 + 32.69α+
192.52α2 +3.24
H
))
(24)
Gapped alignment theory – p. 23/32
-
Gapped alignmentsThe θ and κ functions
Gapped alignment theory – p. 24/32
-
Application to real data• In order to test the formula it the prediction was
compared with the SSEARCH program.• SSEARCH is a sensitive method which fits an
extreme value distribution to the scores.• The PDB40D_J dataset, which is comprised of
935 sequences of known structure from the SCOPdatabase, was used.
• A two step procedure was used, where the Kgand λg values were first used to filter out theworst scoring alignments (p0 > 0.001).Afterwards a more sensitive method, which tooksequence compositions into account, was used tocalculate a new p-value p1.
• In general p1 > p0. Gapped alignment theory – p. 25/32
-
Application to real data
• The formula outperformed SSEARCH, as well asFASTA ktup = 1 and WU-BLAST.
Gapped alignment theory – p. 26/32
-
Comparison to profile methods• A profile is a position dependent score matrix
Si(a), and a position independent gap function.• The density function for a score x is
h(x) =∑
i,a:Si(a)
pa
L, (25)
where pa is the probability for a occurring at agiven position. Using this formula, one canderive a λu by plugging it into equation (9).
• Since L is normally small we can approximatep-values using equations (23) and (24).
Gapped alignment theory – p. 27/32
-
Comparison to profile methods• To test whether the formula could be used to
assess statistical significance of profile methods,1364 profiles were constructed by runningPSI-BLAST using PFAM-A domains as seeds.
• For each profile, 10000 comparisons were madebetween a shuffled version of the profile and arandomly generated sequence (350 aa).
• An extreme value distribution was fit to the data,and compared with the values given by theformula.
• Triangles denote short profiles (L < 75).
Gapped alignment theory – p. 28/32
-
Comparison to profile methods
Gapped alignment theory – p. 29/32
-
Concluding remarksThe good stuff
• Gapped alignment behaviour is wellcharacterized by a single parameter α, givenungapped parameters such as λu, Ku and H .
• The formula is accurate for practical applications,provided α < 0.2.
• Results from applying the formula to thePDB40D_J database indicated that the formulaworks well.
• Improvements in specificity (a good statisticalmodel) can ultimately lead to improvements insensitivity via better Profile generation.
Gapped alignment theory – p. 30/32
-
Concluding remarksThe not so good stuff
• Extremes in compositions were not capturedcompletely by the model.
• The computational cost for evaluating λg and Kgare considerable, when calculating new values foreach alignment.
• Real protein sequences have complexdependencies which is not captured by a randomsequence model.
Gapped alignment theory – p. 31/32
-
The end
Thanks
and
Merry Christmas
Gapped alignment theory – p. 32/32
OverviewUngapped alignment theoryUngapped alignment theoryUngapped alignment theoryUngapped alignment theoryUngapped alignment theory\ ormalsize The $lambda $ and $K$ constantsUngapped alignment theory\ ormalsize Summing upGapped alignmentsGapped alignments\ ormalsize Maximum Likelihood estimationGapped alignments\ ormalsize Greedy Extension Model (GEM)Gapped alignments\ ormalsize Greedy Extension Model (GEM)Gapped alignments\ ormalsize Greedy Extension Model (GEM)Gapped alignments\ ormalsize Greedy Extension Model (GEM)Gapped alignments\ ormalsize Statistical theoryGapped alignments\ ormalsize Statistical theoryGapped alignments\ ormalsize Statistical theoryGapped alignments\ ormalsize The $heta $ and $kappa $ functionsGapped alignments\ ormalsize The $heta $ and $kappa $ functionsGapped alignments\ ormalsize The $heta $ and $kappa $ functionsGapped alignments\ ormalsize The $heta $ and $kappa $ functionsGapped alignments\ ormalsize The $heta $ and $kappa $ functionsGapped alignments\ ormalsize The $heta $ and $kappa $ functionsGapped alignments\ ormalsize The $heta $ and $kappa $ functionsApplication to real dataApplication to real dataComparison to profile methodsComparison to profile methodsComparison to profile methodsConcluding remarks \ ormalsize The good stuffConcluding remarks\ ormalsize The not so good stuffThe end