the derivation of ungapped global protein alignment score distributions - part1

11
The derivation of ungapped global protein alignment score distributions Abstract For the gap-less global protein sequence alignment with N residues, the score distribution can be expressed as sum of probabilities from multinomial distribution elements with same score. The score distribution itself is dependant to length, substitution matrix and amino acid frequencies. Introduction The sequence alignment score distribution is an important determining factor of e-values of sequence alignments. As considering the importance of protein alignment score expected value calculation in biomedical research fields, understanding the exact nature of score distributions is of great importance with major consequences. There were many studies to determine the function that can best fit the observed score distributions (refs). Although there are weak consensus on the function that performs best is an extreme value distributions, there is no studies that derived it by theoretical examination of derivation of sequence alignment score distribution. In this study, the formula for the probability distributions of alignment scores is derived from theoretical view point, then the results would be applied to examine theoretically predicted score distributions against observed score distributions of randomly generated sequences. The comparison of alignment score distributions between the naturally occurring sequences and predictions, the best-fit function will be presented, and at last, approach to the generalization will be discussed. The methods An examination of alignment score as multinomial event The aligned protein sequences are paired sequences with assigned matches/mismatches between two sequences. These amino acids within each sequence have no correlation/dependency to neighboring residues in terms of score calculations (not in biological contexts), therefore completely independent in ungapped sequence alignment. Based on this, a pair-wise alignment score of sequences with k matches in N residues was modeled as multinomial event with k successes of independent events with multiple outcomes in N total trials. The match with particular score (matching score for each independent residue types) was considered as an outcome (successful) and mismatch was also taken as an outcome (failure) within N trials. Assume there are m residue types for the pair of sequences composed of N residues. They were aligned with x 1 matches with p 1 occurrence probability of r 1 residue type associated with s 1 score per match, to x m matches with p m occurrence probability of r m residue type associated with s m score per match and mismatch with residual probability (p f ) of all matches 1- { p i :i=1~m} , each probability of this multinomial event P (this combination of residue types) and score Sc are expressed as P = N! x 1 ! x 2 ! x m !( N k )! p 1 x 1 p 2 x 2 p m x m p f N k = N! ( N k )! x i ! i m p f N k p i x i i m , P f = 1- p i 2 i m (1)

Upload: keiji-takamoto

Post on 07-Aug-2015

65 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The derivation of ungapped global protein alignment score distributions - Part1

The derivation of ungapped global protein alignment score distributions

Abstract

For the gap-less global protein sequence alignment with N residues, the score distribution can be expressed as sum of probabilities from multinomial distribution elements with same score. The score distribution itself is dependant to length, substitution matrix and amino acid frequencies.

Introduction The sequence alignment score distribution is an important determining factor of e-values

of sequence alignments. As considering the importance of protein alignment score expected value calculation in biomedical research fields, understanding the exact nature of score distributions is of great importance with major consequences. There were many studies to determine the function that can best fit the observed score distributions (refs). Although there are weak consensus on the function that performs best is an extreme value distributions, there is no studies that derived it by theoretical examination of derivation of sequence alignment score distribution. In this study, the formula for the probability distributions of alignment scores is derived from theoretical view point, then the results would be applied to examine theoretically predicted score distributions against observed score distributions of randomly generated sequences. The comparison of alignment score distributions between the naturally occurring sequences and predictions, the best-fit function will be presented, and at last, approach to the generalization will be discussed. The methods An examination of alignment score as multinomial event

The aligned protein sequences are paired sequences with assigned matches/mismatches between two sequences. These amino acids within each sequence have no correlation/dependency to neighboring residues in terms of score calculations (not in biological contexts), therefore completely independent in ungapped sequence alignment. Based on this, a pair-wise alignment score of sequences with k matches in N residues was modeled as multinomial event with k successes of independent events with multiple outcomes in N total trials. The match with particular score (matching score for each independent residue types) was considered as an outcome (successful) and mismatch was also taken as an outcome (failure) within N trials. Assume there are m residue types for the pair of sequences composed of N residues. They were aligned with x1 matches with p1 occurrence probability of r1 residue type associated with s1 score per match, to xm matches with pm occurrence probability of rm residue type associated with sm score per match and mismatch with residual probability (pf) of all matches 1- {pi:i=1~m} , each probability of this multinomial event P (this combination of residue types) and score Sc are expressed as

P =

N!x1!x2! xm!(N k)!

p1x1 p2

x2 pmxm pf

N k =

N!

(N k)! xi!i

m p fN k p i

xi

i

m, Pf = 1-

pi2

i

m

(1)

Page 2: The derivation of ungapped global protein alignment score distributions - Part1

Sc = x1s1 + x2s2 + … + xmsm =

!

xisii

m

" (2)

For particular score Si, total probability is a sum of probabilities of all combinations of positive lattice points on superplane Si = x1s1 + x2s2 + … + xmsm.

The prediction of alignment score distributions The matching probability was defined as the probability of which this amino acid was

observed as “match” in alignment result. Each amino acid has unique scores for match/mismatch events in substitution matrix (such as BLOSUM62; refs). Thus, (1/20)2 was given for synthetic sequences with even frequencies and Fobs

2 for each amino acid observed with frequency of Fobs (amino acid frequency) in naturally occurring sequences in the case of self-match only (success/failure model). As substitution matrix defines unique matching/similarity scores or mismatch penalties for pairs of residues, it was necessary to calculate probabilities of all score types (score or penalty; refs). Each probability of occurrence F(i,j) for pair of amino acid i, j was defined as F(i,j)=Fobs(i) !Fobs(j) . The score types St were defined as score values found within substitution matrix (in this study, -4~9,11 from BLOSUM62), Pt(k) for score type k was the summation of each probability of matrix element with score type St(k). In this case, there was no success or failure. The calculation on BLOSUM62 matrix gave St={-4,-3,-2,-1,0,1,2,3,4,5,6,7,8,9,11} and Pt(k)={0.0415 ,0.1806 ,0.2335 ,0.2275 ,0.1429 ,0.0666 ,0.0373 ,0.0104 ,0.0287 ,0.0160 ,0.0111 ,0.0031 ,0.0005 ,0.0002 ,0.0001}, based on amino acid frequencies from UniProt (2013). It should be noted that this occurrence probability distribution of score types is dependent to amino acid frequencies and affects the outcome. The probabilities of the combinations of pair of residue types and their occurrence probabilities can be replaced by this score types and their occurrence probabilities. Thus, rather than dealing with 210 combination of residue types, 15 score types (BLOSUM62) to 20 types (PAM250) are necessary to be considered.

For the pairs of aligned sequences, the sequence score distribution was given by flowing equations with t score types. The occurrence probability p(N,xti,Sti:i=1~t) of alignment score of N residue sequences with particular numbers of xti of Sti residue type with probability of Pt(i) can be expressed as;

!

p(N,xti,Pti : i =1 ~ t) =N!

xt1!xt2!!xtt!pt1

xt1 pt2xt2! ptt

xtt ,

!

xtii

m

" = N (3)

Sc = xt1St1 + xt2St2 + … + xttStt (4) By enumeration of all combinations with score of Sc and summation of probabilities, alignment score distribution for this particular parameter set (amino acid frequencies and substitution matrix) can be calculated as Pc = !{p(score is Sc )} . The examination of prediction by derived probability formula and scores with sequence alignments of randomly generated synthetic sequences

The protein sequences were generated randomly with (1) even frequencies for all 20 amino acids or (2) observed frequencies in real sequences (ref: UniProt 2013) in order to address the obvious effect of sequence length and hypothesis that amino acid frequency is potential factor of the distribution shape. Synthetic sequences were generated with varied length (9/18/36/54/72/90/120/150/180/210/240/270) in order to obtain better amino acid frequency statistics compliance to UniProt statisitcs. Natural sequence fragments are generated from CATH

Page 3: The derivation of ungapped global protein alignment score distributions - Part1

3.4 domain database (ref:CATH) and fragmented into given lengths without overlap in order to avoid the effect of overlapped sequences. The total 5000 sequences were randomly selected from fragments and aligned pair-wise with all combinations. The sequence alignments were performed with Needleman-Wunsch algorithm with modified/unmodified BLOSUM62 or synthetic substitution matrices with both natural and synthetic random sequences. There was another apparent hypothesis that the substitution matrix is determining factor for the score distributions. The calculations of score distributions with substitution matrices were performed by binning the alignment scores. Following section describes details on the calculations.

The scoring schemes were (1) matching with score of 1 (2) evenly distributed score/match score of 4~8 (4 residues each) for 20 commonly observed amino acids without mismatch penalties to mimic the BLOSUM62 substitution matrix in simplified manner (3) matrix derived from BLOSUM62 substitution matrix diagonal components with neither mismatch penalties nor scores for rest of matrix (4/5/6/7/8/9/11 for match/residue with 5/6/4/2/1/1/1 occurrences, respectively) and (4) BLOSUM62. Sequence alignments were performed by in-house code with all matrices with default (-10,-2) or very high (9999 for both to make alignment gapless) gap opening and extension penalties. The scores were binned with the interval width of 1 and assembled. The predictions of score distribution were performed by in-house code that calculated all combinations of occurrences of multinomial events (all score types), with given scoring scheme and probabilities of occurrences that belonged to same score were enumerated and summed. Predicted probability distributions were then compared with the distributions calculated from pair-wise alignment scores of synthetic/natural sequences. The occurrence probabilities for the prediction were collected from stats of UniProt 2013 dataset or adjusted by observed amino acid frequencies in case of naturally occurring sequences as observed amino acid frequencies varied set to set. The fitting of alignment score distributions

The MATLAB codes utilizing “lsqnonlin” were used to fit score distributions with various statistical distribution functions (binomial/Poisson/Gamma/Extreme value distributions). These functions were modified to take into account the X-axis (score) shift and peak width adjustment. The amplitude was also adjustable. The values of R-squared were calculated to examine fits along with eye-inspection for determination of skewness differences and/or systematic biases of residual in fittings. Following were the actual functions used to fit the data. Binomial and Poisson were extended from discrete functions to continuous forms by replacing factorials by gamma functions due to technical issues of discrete fitting procedures.

Binomial distribution

!

fB (x) = A"#(N +1)

#( x $ x0w

+1)" #(N $x $ x0w

+1)px$x0w (1$ p)

N $x$x0w

Poisson distribution

!

f p (x) = A"#x$x0w

%( x $ x0w

+1)e$#

Gamma distribution

!

fg (x) = A"x # x0( )k#1

$(k)" % k e#x#x0%

Extreme value distribution

!

fe (x) = A" e(x#x 0w

#ex#x0w )

Page 4: The derivation of ungapped global protein alignment score distributions - Part1

The examination of parameter trajectories by sequence length changes The MATLAB codes for simple linear and exponential fit were written and used to obtain

the length dependent probability calculation formula for the ungapped global alignment score distributions. The varied sequence lengths were subjected to the predictions of the score distributions in order to examine the qualitative and quantitative trajectories of parameters by sequence length changes. The best-fit model of extended gamma distribution has 3 or 4 parameters and each parameter was fitted with linear or exponential depending on the shape of curve as the function of sequence lengths in order to examine whether they were predictable as the functions of sequence length.

Results

The derivation of score probability density formula The view of alignment results as simple multinomial event with particular set of residue types as outcomes (score/match) led to the well-known formula of multinomial distribution. This distribution itself did not give a score distribution as the distribution gave the probability of the occurrence of “combination” of residue types. In order to calculate the distribution of scores, all conceivable combinations needed to be enumerated and all probabilities with same score values were necessary to be summed up.

Independent from how sequence alignment was performed, a pair-wise alignment score of sequences can be interpreted as simple multinomial events. The match with particular score type is an outcome (see method section). Thus, we can model it as a simple multinomial event with N trials (refs) with m outcomes (residue types). In the same sense, multiple residue types with same score type can be combined to single entity as the combination of score type was what determines the scores but not their “sequences”. Thus, in sense, no matter how their sequences are varied or shuffled, as long as there were “matches of 3 Ala, 1 Asn, 5 Leu, 1 Tyr and 1 Val”, they are same. Then it can be then reduced to “9 of score 4 matches and 1 of score 6 match, and a score 7 match”. The formula to calculate probabilities for the given combination of residue types and score were given by equations 1 and 2. With the points above, further reduction of the expression of calculation by converting residue types to score types was achieved, now formulas were transformed to equations 3 and 4 expressed by score types rather than residue types. As the formula for the calculation of score and its probability of occurrence were derived theoretically, simple enumeration of all combinations of score types with given length of sequences and summation of probabilities of combinations with same scores were possible by simple computation. Unfortunately, this calculation was extremely computationally expensive as the number of combination increases in geometric progression, the computation of longer sequences were not scalable. There are 15 score types, therefore, the number of combinations are expressed by f(L)=(L+14)/L·f(L-1), (L indicates sequence length). On the workstation with 2.66GHz Xeon, it took over 9 days to compute in single thread. Increase of sequence length results in significant increase of load and computation with reasonable resource within reasonable time was out of reach. For particular score, total probability is a sum of probabilities of all combinations of positive lattice points on superplane expressed by equation 4. Unfortunately, the derivation or calculation of such a points is beyond this study and the subject itself would be its own study for the mathematicians. It appears not be easy task or simply may not be possible for the biomedical researcher such as myself. The validation of predictions

Page 5: The derivation of ungapped global protein alignment score distributions - Part1

The score distribution could be calculated from the formula derived in method section (equation 1 and 2) as summation of probabilities from same score with varied residue combinations. The figure 1 shows the comparison of predicted distributions and those calculated from alignment scores of synthetic sequences. The even occurrence probability for score 4~8 for each match is shown in panel A. The two data matched almost completely with R2 >0.99996. As there is no penalty and scores are between 4 and 8, the distribution is not the bell-shaped but rather more complex shapes particularly for the shorter sequences. This result strongly indicates that score distribution is not necessarily bell-shaped curve that can be expressed in simple statistical distribution functions. In this case, in order for the simplification, amino acid frequencies were same (1/20) for all residue types and substitution matrix is 0 for non-diagonal elements, thus [4,5,6,7,8] were set 4 each in diagonal positions. There were no scores of 1/2/3 as there was no combination for these values, thus, both prediction and actual alignment scores did not show them. Scores 4~7 were equally probable as there was single combination, 8 had score 8 alone and two 4s, thus higher in probability, so on. With sequence length went up, more smooth the distributions were, although there were still jaggedness even with 120 residues. The distributions with the use of synthetic substitution matrix with diagonal BLOSUM62 matrix were shown in panel B. With more complex score types with differing occurrences (score [4,5,6,7,8,9,11]=5,6,4,2,1,1,1 occurrences), the distributions were slightly more smoothed out but basically same as 4-8 matrix. The R2s ranged in 0.9987~0.9999. The use of BLOSUM62 matrix was computationally expensive to produce predictions for the longer sequences as the execution time increase roughly in order of n! (15HN=15+N-1CN as number of multinomial combination is expressed by repetitive combinations). Thus, only sequence length of 9, 18 and 36 were predicted and are shown. With ungapped sequence alignments of synthetic sequences with uniform amino acid frequencies are shown in panel C. The matching are almost perfect with R2=0.9996~0.9999 for all lengths. As obviously expected to be, the sequence score distributions are length dependent, but not in simple manner. With the use of full BLOSUM62 matrix, there are 15 types of multinomial events (score/match of 4~9,11, similarity scores 1~3 and the mismatch penalties -4~0), thus, completely smoothing out the distributions. Unfortunately, more residue type meant more nested loops for the calculations, it was very computationally expensive to calculate the distribution of longer sequences. The increasing calculation time by factor of elength/2.28 makes it not practical (takes less than 1s for 9 residue sequence but over 9 days for 36 residues). But this examination confirms the formula was correct and substitution matrix determines the shape of distributions. In panel D, the results from the same BLOSUM62 with amino acid frequency calculated from UniProt (release 2013_03) stats is shown. The matches between prediction and distributions of actual sequence alignment scores of synthetic sequences match very well (R2>0.99996). An interesting point is that the score distributions are shifted toward higher direction. This may come from the difference in overall matching probability for positive scores for even and observed amino acid frequencies (0.155/0.174, respectively). Thus, the hypothesis that amino acid frequency affects the score distribution shape, is proven to be true.

Figure 2 shows small discrepancy between predictions and score distributions from alignments of naturally occurring sequences. This was expected as we saw in previous section that amino acid frequencies differences resulted in the changes in score distributions (figure 1 panel C/D). This was assumed to be caused by small differences in amino acid frequencies of actual fragment datasets from CATH 3.4 database and frequencies adopted from UniProt statistics. The amino acid frequencies of actual sequences used for the alignments were

Page 6: The derivation of ungapped global protein alignment score distributions - Part1

calculated and predictions were re-calculated with these adjusted amino acid frequencies. There were 3.2~6.1% deviations of frequency per residue type (as median values of change in frequencies) between CATH 3.4 derived sequence datasets and UniProt values. After adjusting the frequencies for the predictions, the R2 of fits went up from 0.9982/0.9975/0.9970 to 0.9999/0.9999/0.9997 and obvious systematic residuals were mostly eliminated. Thus, even relatively small differences in amino acid frequencies affected the distribution in easily detectable level.

The fitting of score distributions

The score distributions were fitted with 4 families of statistical distribution functions. (1) Binomial distribution (2) Poisson distribution (3) Gamma distribution and (4) Extreme value distribution. As the length of sequence strongly affect the shape of distributions, fittings were performed in 9/18/36/54 residues. As shown in figure 1, the synthetic substitution matrices with neither mismatch penalties nor similarity score do not have bell-shaped distributions but rather more jagged shape, obviously it is futile trying to find function that can fit these distributions. Thus, only smoothed out distributions of BLOSUM62 with/without gap were fitted with these functions. Figure 3 and Table 1 show the fitting results of 4 functions with different sequence lengths, amino acid frequencies with pair-wise alignment score distributions from synthetic sequences. Clearly, gamma distribution was the best function for the BLOSUM62 results for all length of sequences studied with highest R2 and smallest systematic residuals. Poisson and binomial distributions also performed well in longer sequences but not as good as gamma in the short ones. Extreme value distribution had obvious difference in skewness that cannot be compensated by parameters and was the worst function among 4. In the case of substitution matrix with 1 at diagonal position and 0 for other positions (represents Bernoulli trial), as expected, the score distributions completely matched with binomial distributions (data not shown). But as it can be seen in figure 1A/B, distributions by substitution matrix with multiple score types were not simple convolution of binomial distributions. The binomial and Poisson distribution could explain the shape of distributions, in order to fit well, however, they required non-unity amplitude parameters. This fact did not sit well with the nature of probability density(mass) functions, thus, also from this point, gamma distribution is the best distribution for ungapped alignment score distributions. These functions were also tested with naturally occurring sequences derived from CATH domain dataset (Figure 4). The pair-wise global sequence alignments were computed with both ungapped and gapped manner. The results were same as synthetic sequence alignments. Despite the differences in amino acid frequencies that affected distribution shapes, all models tested (even, UniProt and CATH derived amino acid frequencies) resulted in the same best-fit distribution (gamma distribution). Therefore, it is reasonable to conclude that the alignment score distributions of sequences aligned with BLOSUM62 matrix follow gamma distribution. The fitting of the predicted distributions, which were deemed to be error free as probabilities are precisely calculated from exact analytical solution, by gamma distribution resulted in the R2-values of >0.99999 (length 12 or longer) or 0.999900/0.999979 for length of 6 and 9, respectively.

The derivation of formulas for parameter prediction as functions of sequence length, evaluation of predictability by extrapolated parameters

In previous sections, it is proved that the multinomial event model could exactly predict the score distributions by different substitution matrices, lengths and amino acid frequencies. As

Page 7: The derivation of ungapped global protein alignment score distributions - Part1

explained already, however, the enumeration of combinations and calculation of probabilities for scores are extremely computationally expensive as combinations increase by roughly n! order. Therefore, utilizing prediction power to generate the series of distributions as function of sequence length and fit these models with gamma distribution to predict the parameters has been sought. As shown in Figure 5, parameters in short sequence ranges behaved pretty well in terms of predictability. The gamma distribution has two shape deciding parameters (k/!) and in this study, two more parameters were introduced to handle negative scores (x0: x-shift in order to avoid complex parameters or undefined gamma function values) and potential necessity for adjusting the amplitude. Gamma distribution did not require amplitude adjustment as “best fit” parameters for varied length of sequences had values almost exactly 1.0. This is a good property for the distribution function as the fitting is performed on probability density function and it should have amplitude of exact 1.0. Thus, in further fittings, this amplitude parameter was fixed to 1.0. The multinomial event model was utilized to predict the score distributions for different lengths of sequences. These distribution models were then fitted by gamma distribution from estimated initial parameters. This was an important precaution as with multivariate fitting session, initial parameters are very important for the convergence of fitting and arrival of “series” of fitting results that can be compared. If initial parameters are not consistently given, it is not guaranteed to fall into comparable local minima or even did not converge at all. Figure 5 shows the results of distribution prediction and fitting, then fitting of gamma distribution parameters (k/! /x0) by the linear ( f(x)=a+bx ) or exponential function ( f(x)=a+b*power(x/c,d) ). The resultant fits were very well (>0.99999 for k/x0 and 0.99986 for !). The obtained parameters from these distributions (sequence lengths of 6/9/12/15/18/21/24/27/30/33/36) were extrapolated to longer sequence lengths (54~270 residues) and distributions were calculated from extrapolated parameters. Those calculated distributions (which are too long to be predicted by exact formula with reasonable time, resources) were then compared with actual sequence alignment results up to 270 residues long synthetic sequences. Figure 6 shows the comparison of actual sequence alignment score distributions and distributions calculated from extrapolated parameters. These pairs of distributions matched in the range of R2 values of 0.99952~0.99998 for the sequence lengths from 9 to 270. In order to evaluate the extrapolated parameters in comparison with fitted parameters from alignment score distributions, score distributions from alignment results were fitted with gamma distributions with extrapolated parameters as initial values. They were matched very well with error 0.15~0.38% in average.

Thus, if database amino acid frequencies are known and substitution matrix is given, the distribution of alignment scores can be calculated with good accuracy as a function of sequence length. This translates into precise calculation of e-values of alignment even without exact combinatorial probability calculation and enumeration. Computationally, this is the biggest benefit as regular workstation can calculate the e-value precisely in very short time (gamma distribution parameters are given, thus cumulative probability function is known).

Page 8: The derivation of ungapped global protein alignment score distributions - Part1

Table 1: The R2-values for the fitting of alignment score distributions of synthetic sequences by statistical distribution functions

even 9 18 36 54 120 Gamma 0.9999 0.9999 0.9998 0.9998 0.9996 Extreme 0.9974 0.9921 0.9872 0.9910 0.9746 Poisson 0.9992 0.9999 0.9998 0.9997 0.9997 Binomial 0.9991 0.9998 0.9998 0.9997 0.9997

A.A. freq 9 18 36 54 120 Gamma 0.9998 0.9999 0.9999 0.9998 0.9997 Extreme 0.9950 0.9892 0.9852 0.9821 0.9749 Poisson 0.9997 0.9999 0.9999 0.9997 0.9997 Binomial 0.9996 0.9999 0.9999 0.9998 0.9997

!!!!!!

Length Binomial Poisson Gamma Extreme 6 0.9989 "#$%$& "#$$$$! "#$$'(!9 0.9999 "#$$$'! 1.0000 "#$$&'

12 1.0000 1.0000 1.0000 "#$$)*

15 1.0000 1.0000 1.0000 "#$%$$

18 1.0000 1.0000 1.0000 "#$%%'

21 1.0000 1.0000 1.0000 "#$%%' 24 1.0000 1.0000 1.0000 "#$%'(!27 1.0000 1.0000 1.0000 "#$%*$

Page 9: The derivation of ungapped global protein alignment score distributions - Part1

0

0.02

0.04

0.06

0.08

-­60 -­40 -­20 0 20 40

0

0.02

0.04

0.06

0.08

0.10

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95

L9

L18

L36

L54

L120

L9

L18

L36

L54

L120

Alignment  score

Probability

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95

0

0.02

0.04

0.06

0.08

0.10

0.12 L9

L18

L36

L54

L120

L9

L18

L36

L54

L120

Alignment  score

Probability

Alignment  score

Probability

0

0.02

0.04

0.06

0.08

Alignment  score

Probability

Figure  1  

   The  observed  and  predicted  alignment  score  distributions  

(A)  4~8  even (B)  BL62  diagonal

(C)  BL62  even  freq. (D)  BL62  UniProt  freq.

Observed

Predicted

Observed

Predicted

Score  distributions  (even  A.A.  freq.)

Observed  (L9)

Predicted  (L9)

Observed  (L18)

Predicted  (L18)

Observed  (L36)

Predicted  (L36)

0

0.02

0.04

0.06

0.08

-­60 -­40 -­20 0 20 40

Observed  (L9)

Observed  (L18)

Score  distributions  (even  A.A.  freq.)

Observed  (L18)

Score  distributions  (UniProt  A.A.  freq.)

Observed  (L9)

Predicted  (L9)

Observed  (L18)

Predicted  (L18)

Observed  (L36)

Predicted  (L36)

Figure  2  

   The  effect  of  amino  acid  frequencies  to  the  distribution  

residuals

Observed  distribution

Prediction  (UniProt)

Prediction  (Adjusted)

Observed  distribution

Prediction  (UniProt)

Prediction  (Adjusted)

L18

L9

L36

0

0.02

0.04

0.06

0.08

Probability

-­70 -­60 -­50 -­40 -­30 -­20 -­10 0 10 20

Alignment  score

Observed  distribution

Prediction  (UniProt)

Prediction  (Adjusted)

Page 10: The derivation of ungapped global protein alignment score distributions - Part1

-­50 -­40 -­30 -­20 -­10 0 10 20 30

0

0.01

0.02

0.03

0.05

0.04

Alignment  score

Probability

-­30 -­20 -­10 0 10 20 30 40 50

0

0.02

0.04

0.06

0.08

Alignment  score

Probability

-­70 -­60 -­50 -­40 -­30 -­20 -­10 0 10

0

0.01

0.02

0.03

Alignment  score

Probability

0

0.005

0.010

0.015

0.025

0.020

Alignment  score

Probability

-­90 -­60 -­50 -­40 -­30 -­20 -­10 0 10-­70-­80

Extreme

BinomialPoissonGamma

Extreme

BinomialPoissonGamma

Extreme

BinomialPoissonGamma

Extreme

BinomialPoissonGamma

Figure  3

   The  fitting  of  alignement  score  distributions  of  synthetic  sequences      with  UniProt  A.A.  frequencies  by  4  statistical  distribution  functions

R2=0.9996R2=0.9997R2=0.9998R2=0.9950

R2=0.9999R2=0.9999R2=0.9999R2=0.9892

R2=0.9998R2=0.9997R2=0.9998R2=0.9821

R2=0.9999R2=0.9999R2=0.9999R2=0.9852

9 18 36 54Sequence  length

R2

0.995

0.990

0.985

0.980

GammaExtremePoissonBinomialU

ngapped Gamma

ExtremePoissonBinomialG

apped

Figure  4    The  Fitting  results  on  naturally  occurring  sequences

1.0000

0.9995

Page 11: The derivation of ungapped global protein alignment score distributions - Part1

-­60 -­50 -­40 -­30 -­20 -­10 0 10 200

0.02

0.04

0.06

0.08

Alignment  score

Probability

9

6

1215

18212427

f ( x ) Ax x0

k 1

(k ) k ex x0

Figure  5    The  Calculation  of  distribution  formula

Gamma  distribution

96 121518212427Sequence  length

k

f(x)=-­3.557.x+2.310

Sequence  length105 30252015 4035

10

50

3020

60

40

7080

1.40

1.60

1.50

1.45

1.65

1.55

f(x)=1.378          +0.398.(-­x/3.813)-­1.199

Sequence  length105 30252015 4035

f(x)=-­4.114.x+2.447

Sequence  length105 30252015 4035

-­160

-­80

-­120-­140

-­60

-­100

-­40-­200

x03033

36

303336

Figure  6Alignement  derived  and  parameter  fit  predicted  distributions

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Probability

-­350 -­300 -­250 -­200 -­150 -­100 -­50 0 50Alignment  score

9

18

3654

7290120150180210240270

SequenceLength

(R2  =  0.99998)

(R2  =  0.99992)

(R2  =  0.99966)

(R2  =  0.99962)

(R2  =  0.99952)

(R2  =  0.99960)

(R2  =  0.99963)

(R2  =  0.99972)

(R2  =  0.99965)

(R2  =  0.99965)

(R2  =  0.99974)

(R2  =  0.99958)L=    9  

L=  18  

L=  36  

L=  54  

L=  72  

L=  90  

L=120  

L=150  

L=180  

L=210  

L=240  

L=270  

Observed

Predicted