gapped alignment theory - web.math.ku.dkweb.math.ku.dk/~richard/courses/binf_project/bjarni.pdf ·...

32
Gapped alignment theory Mott R, Tribe R. (1999) Approximate statistics of gapped alignments. J Comput Biol. Spring;6(1):91-112. and Mott R. (2000) Accurate formula for P-values of gapped local sequence and profile alignments. J Mol Biol. Jul 14;300(3):649-59. Bjarni Vilhj´ almsson [email protected] Bioinformatics centre 14. December, 2004. Gapped alignment theory – p. 1/32

Upload: others

Post on 27-Jan-2021

14 views

Category:

Documents


0 download

TRANSCRIPT

  • Gapped alignment theoryMott R, Tribe R. (1999) Approximate statistics of gapped alignments. J Comput Biol. Spring;6(1):91-112.

    andMott R. (2000) Accurate formula for P-values of gapped local sequence and profile alignments. J Mol Biol.

    Jul 14;300(3):649-59.

    Bjarni Vilhjá[email protected]

    Bioinformatics centre14. December, 2004.

    Gapped alignment theory – p. 1/32

  • Overview• Repetition of statistical theory for ungapped

    alignment.• Statistical theory for gapped alignment

    - Maximum Likelihood method (Tommy’s paper)

    - Greedy Extension Model (GEM)

    - Statistical theory.

    - The θ and κ functions.

    • Application of the model to real data.

    • Comparison to Profile methods.• Concluding remarks

    Gapped alignment theory – p. 2/32

  • Ungapped alignment theoryThe score Z for an ungapped sequence isapproximately extreme value distributed

    Pr(Z > t) ∼ 1 − exp(

    −Kmne−λt)

    (1)

    when n, m, t → ∞ and log mlog n → 1. That is the logp-value is approximately linear for a large t

    − log (Pr(Z > t)) ∼ − log (Kmn)) + λt. (2)

    • Note that the score can only take discrete values.This can be fixed to some extent by carefullychoosing values for the K constant.

    Gapped alignment theory – p. 3/32

  • Ungapped alignment theory• A bit more formal theory is needed before we

    continue.

    Given two independent finite sequences Ui and Vj ,where pa is the probability for a in U and qa for a inV . And a score Sab for matching letters a, b, such thatthe expected score is negative, i.e.

    ab

    paqbSab < 0. (3)

    Furthermore lets define a dot matrix

    Mij ≡ SUi,Vj . (4)

    Gapped alignment theory – p. 4/32

  • Ungapped alignment theoryNow lets define sums of consecutive scores as

    Yijk =

    {

    0 k = 0∑n 0(5)

    The constrained maximum sum Zij is the maximumattained by Yijk such that k < Tij , where Tij is thefirst value of k for which Yijk is negative, i.e.

    Zij = maxk≤Tij

    Yijk. (6)

    Define Cij to be the coordinate where the Yijk takesthe maximum in the constrained sum.

    Gapped alignment theory – p. 5/32

  • Ungapped alignment theoryUnder the assumptions mentioned previously and thatthe diagonals in the dot matrix are independent, theglobal maximum of the constrained sum

    Zmax = maxij

    Zij (7)

    is approximately extreme value distributed, i.e.

    Pr(Zmax > t) ∼ 1 − exp(

    −Kmne−λt)

    . (8)

    Gapped alignment theory – p. 6/32

  • Ungapped alignment theoryThe λ and K constants

    Like we have seen in earlier lectures these constantsK and λ can be calculated. The λ value is found bysolving

    ij

    h(x)eλSij = 1 (9)

    and it can be interpreted as a some sort of scalingconstant for the score matrix used. Here h(x) isprobability density for score values M , i.e.

    Pr (M = x) ≡ h(x) =∑

    ij:Sij=x

    piqj. (10)

    The K constant has a much more complex analyticalformula, which I will not give here.

    Gapped alignment theory – p. 7/32

  • Ungapped alignment theorySumming up

    • This theory works well for ungapped alignments,under the assumptions that both n, m and thethreshold t are large (and more assumptions).

    • In modern applications, gapped alignments arehowever much more common.

    • This has partly been solved by estimatingparameters of an extreme value distribution, fordifferent scoring schemes and gap penalties.

    • Small errors in λ and K can result in large errorsin the p-value.

    • A better statistical theory for gapped alignmentshas been greatly missed.

    Gapped alignment theory – p. 8/32

  • Gapped alignmentsGaps are normally modelled in scoring schemes by anaffine gap penalty, i.e.

    g(n) = A + Bn, (11)

    where A and B are the gap-opening andgap-extension parameters.

    Gapped alignment theory – p. 9/32

  • Gapped alignmentsMaximum Likelihood estimation

    As shown in by Mott (1992), it is possible to findparameters for a general extreme value distribution

    Pr(Zmax > t) =∼ 1 − exp(

    −e−(t−T

    S ))

    , (12)

    such that it can be used to model gaps to a certainextent. Here S is a scale parameter and T is atranslation parameter. The best results were found bychoosing

    S = s1λ and T = t0 +t1 + t2 log(mn)

    λ(13)

    where MLE is done on the parameter set(a0, a1, a2, b1). Gapped alignment theory – p. 10/32

  • Gapped alignmentsGreedy Extension Model (GEM)

    • In order to understand the ideas behind theformula for of gapped alignment given in Mott(2000), it is important to understand GEM.

    Using the same notation as previously, then lets definethe unconstrained maximum score as

    Z ′ij = maxk≥0

    Yijk. (14)

    Now with the affine gap function g(n), as definedearlier, GEM works in 5 steps as follows.

    Gapped alignment theory – p. 11/32

  • Gapped alignmentsGreedy Extension Model (GEM)

    1. Consider all the HSPs in the dot matrix M ,i.e. those coordinates for which Zij > t, where tis some appropriate threshold. Let Zij be themaximum sum for one such HSP, starting at (i, j)and ending at Cij = (i + k − 1, j + k − 1).

    2. (Greedy step) From (i + k, j + k) scanneighbouring diagonals to find the best place toextend the alignment. This place is found bymaximizing U1, where

    U1 = maxn>0

    (

    Z ′i+k,j+k+l − g(l), Z′i+k+l,j+k − g(l)

    )

    .

    (15)

    Gapped alignment theory – p. 12/32

  • Gapped alignmentsGreedy Extension Model (GEM)

    3. Repeat greedy step U1, U2, . . . and generateU -walks U1, U1 + U2, . . .

    4. Let T (Z) denote the first time the U -walk dropsbelow Z. Now define a W score

    W (Z) = Z + max(0, U1, U1 + U2, . . . ,

    U1 + · · · + UT (Z)). (16)

    5. Finally define WGEM as the maximum of all theW scores, i.e.

    WGEM = maxij

    (W (Zij)) (17)

    Gapped alignment theory – p. 13/32

  • Gapped alignmentsGreedy Extension Model (GEM)

    • The GEM algorithm results often in an optimallocal alignment, WGEM .

    • The algorithm is however only a heuristic andcannot be guaranteed to find the optimal localalignment.

    • Large gap-penalties result in good alignments.• GEM is interesting because there exist

    probabilistic asymptotic results for it, (i.e. forlarge n, m, t and a gap cost function f(n) we canestimate Pr(WGEM > t)).

    Gapped alignment theory – p. 14/32

  • Gapped alignmentsStatistical theory

    • Assuming that GEM alignment is close to theoptimal local gapped alignment (e.g. thealignment found by the Waterman-Smithalgorithm), we can estimate the probability of anoptimal local gapped alignment.

    • The probability distribution for the GEM score,can be captured in only changing values for theconstants K and λ.

    • These new Kg and λg values are related to the oldKu and λu values. (From now on g refers to thegapped model and u to the ungapped model).

    Gapped alignment theory – p. 15/32

  • Gapped alignmentsStatistical theory

    In order to write out formulas for Kg and λg let usdefine a parameter α

    α = 2s∑

    k>0

    e−λug(k) (18)

    where s and the entropy H are

    s =

    δKu

    H(eλuδ − 1)

    H =∑

    x

    xh(x)eλux.

    Here δ is the smallest span of score values.Gapped alignment theory – p. 16/32

  • Gapped alignmentsStatistical theory

    Now Kg can be approximated under the GEM modelas

    Kg = Kuκ(α) (19)

    and λg asλg = λuθ(α). (20)

    Where α depends on the gap function and is 0 only ifno gaps are allowed and grows as gap penalties lower.

    • In practice the critical value for α lies in0.3 < αcrit < 0.4.

    • Under normal conditions in BLAST and FASTAα < 0.2.

    Gapped alignment theory – p. 17/32

  • Gapped alignmentsThe θ and κ functions

    • In the Mott (2000) paper the functions κ and θare approximated as linear functions of α,whereas in Mott and Tribe (1999) moreanalytically complicated upper and lower boundsare given for κ and θ.

    • When θ̂ is estimated by ML method, then θ̂ isapproximately linear as a function of f(n, m),where

    f(n, m) = log(mn)

    (

    1

    n+

    1

    m

    )

    . (21)

    Gapped alignment theory – p. 18/32

  • Gapped alignmentsThe θ and κ functions

    Gapped alignment theory – p. 19/32

  • Gapped alignmentsThe θ and κ functions

    The scoring schemes in A and B in the previous figurewere chosen in order to represent extreme behaviourof θ. From this we can expect a straight line fit, i.e.

    θ = θinf + βf(n, m). (22)

    Furthermore β̂ was fitted to be approximately linear interms of α and 1

    Hand θ̂inf to be approximately linear

    in terms of α.Similar analysis can be applied to kappa as well.

    Gapped alignment theory – p. 20/32

  • Gapped alignmentsThe θ and κ functions

    Gapped alignment theory – p. 21/32

  • Gapped alignmentsThe θ and κ functions

    Gapped alignment theory – p. 22/32

  • Gapped alignmentsThe θ and κ functions

    In all the best fitting equation for θ was

    λg = λu(1.013 − 2.61α +

    f(n, m)

    (

    −0.76 + 9.34α +1.12

    H

    ))

    (23)

    and similarly by analysing log κ we get

    Kg = Ku exp(0.26 − 18.92α +

    f(n, m) (−1.76 + 32.69α+

    192.52α2 +3.24

    H

    ))

    (24)

    Gapped alignment theory – p. 23/32

  • Gapped alignmentsThe θ and κ functions

    Gapped alignment theory – p. 24/32

  • Application to real data• In order to test the formula it the prediction was

    compared with the SSEARCH program.• SSEARCH is a sensitive method which fits an

    extreme value distribution to the scores.• The PDB40D_J dataset, which is comprised of

    935 sequences of known structure from the SCOPdatabase, was used.

    • A two step procedure was used, where the Kgand λg values were first used to filter out theworst scoring alignments (p0 > 0.001).Afterwards a more sensitive method, which tooksequence compositions into account, was used tocalculate a new p-value p1.

    • In general p1 > p0. Gapped alignment theory – p. 25/32

  • Application to real data

    • The formula outperformed SSEARCH, as well asFASTA ktup = 1 and WU-BLAST.

    Gapped alignment theory – p. 26/32

  • Comparison to profile methods• A profile is a position dependent score matrix

    Si(a), and a position independent gap function.• The density function for a score x is

    h(x) =∑

    i,a:Si(a)

    pa

    L, (25)

    where pa is the probability for a occurring at agiven position. Using this formula, one canderive a λu by plugging it into equation (9).

    • Since L is normally small we can approximatep-values using equations (23) and (24).

    Gapped alignment theory – p. 27/32

  • Comparison to profile methods• To test whether the formula could be used to

    assess statistical significance of profile methods,1364 profiles were constructed by runningPSI-BLAST using PFAM-A domains as seeds.

    • For each profile, 10000 comparisons were madebetween a shuffled version of the profile and arandomly generated sequence (350 aa).

    • An extreme value distribution was fit to the data,and compared with the values given by theformula.

    • Triangles denote short profiles (L < 75).

    Gapped alignment theory – p. 28/32

  • Comparison to profile methods

    Gapped alignment theory – p. 29/32

  • Concluding remarksThe good stuff

    • Gapped alignment behaviour is wellcharacterized by a single parameter α, givenungapped parameters such as λu, Ku and H .

    • The formula is accurate for practical applications,provided α < 0.2.

    • Results from applying the formula to thePDB40D_J database indicated that the formulaworks well.

    • Improvements in specificity (a good statisticalmodel) can ultimately lead to improvements insensitivity via better Profile generation.

    Gapped alignment theory – p. 30/32

  • Concluding remarksThe not so good stuff

    • Extremes in compositions were not capturedcompletely by the model.

    • The computational cost for evaluating λg and Kgare considerable, when calculating new values foreach alignment.

    • Real protein sequences have complexdependencies which is not captured by a randomsequence model.

    Gapped alignment theory – p. 31/32

  • The end

    Thanks

    and

    Merry Christmas

    Gapped alignment theory – p. 32/32

    OverviewUngapped alignment theoryUngapped alignment theoryUngapped alignment theoryUngapped alignment theoryUngapped alignment theory\ ormalsize The $lambda $ and $K$ constantsUngapped alignment theory\ ormalsize Summing upGapped alignmentsGapped alignments\ ormalsize Maximum Likelihood estimationGapped alignments\ ormalsize Greedy Extension Model (GEM)Gapped alignments\ ormalsize Greedy Extension Model (GEM)Gapped alignments\ ormalsize Greedy Extension Model (GEM)Gapped alignments\ ormalsize Greedy Extension Model (GEM)Gapped alignments\ ormalsize Statistical theoryGapped alignments\ ormalsize Statistical theoryGapped alignments\ ormalsize Statistical theoryGapped alignments\ ormalsize The $heta $ and $kappa $ functionsGapped alignments\ ormalsize The $heta $ and $kappa $ functionsGapped alignments\ ormalsize The $heta $ and $kappa $ functionsGapped alignments\ ormalsize The $heta $ and $kappa $ functionsGapped alignments\ ormalsize The $heta $ and $kappa $ functionsGapped alignments\ ormalsize The $heta $ and $kappa $ functionsGapped alignments\ ormalsize The $heta $ and $kappa $ functionsApplication to real dataApplication to real dataComparison to profile methodsComparison to profile methodsComparison to profile methodsConcluding remarks \ ormalsize The good stuffConcluding remarks\ ormalsize The not so good stuffThe end