on some measures of genetic distance based on rates of nucleotide substitution

37
DNA nucleotide substitution models 1 Running head: DNA NUCLEOTIDE SUBSTITUTION MODELS On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution Justine Leon A. Uro Ph. D. Graduate Student Department of Biostatistics, University of Michigan Ann Arbor, MI

Upload: justine-leon-uro

Post on 11-Jul-2015

306 views

Category:

Education


0 download

TRANSCRIPT

Page 1: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 1

Running head: DNA NUCLEOTIDE SUBSTITUTION MODELS

On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

Justine Leon A. Uro

Ph. D. Graduate Student

Department of Biostatistics, University of Michigan

Ann Arbor, MI

Page 2: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 2

Abstract

We present a general DNA base-nucleotide substitution model and discuss three special

cases: three-substitution-type (3ST), two-substitution-type (2ST), and the Jukes-Cantor

models.

Page 3: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 3

On Some Measures of Genetic Distance Based on Rates of

Nucleotide Substitution

Introduction

The genetic distance between two populations is defined as a concept related to the

time since the two populations diverged from a common ancestral population (Weir,

1990). A number of methods have been proposed to estimate the genetic distance between

two populations and they are either based on the allele frequencies in the two populations,

the rate of amino acid substitution in protein sequence data from the two populations, or

the rates of base nucleotide substitution in DNA sequence data from the two populations.

Measures of genetic distance that utilize the allele frequencies are estimates based

on some geometric transformation of the allele frequencies (Cavalli-Sforza and Edwards,

1967; Cavalli-Sforza and Bodmer, 1971; Edwards, 1971; Nei, 1977, 1978; Li and Nei, 1977;

Smith, 1977). Some of these measures are purely geometric and do not involve any genetic

concept at all, e.g., the measure proposed by Cavalli-Sforza and Bodmer (Weir, 1990). On

the other hand, the ones proposed by Edwards (1971) and by Nei (1977) can be shown to

berelated to the concept of fixation index (Hartl and Clark, 1989).

A measure of genetic distance based on amino acid substitution from protein

sequence data was proposed by Jukes and Cantor in 1969. This method was partly due to

the abundance of amino acid sequence data available then. Some geneticists argue that

this measure should be preferred since proteins are the subject of mutations.

The discovery of DNA sequencing by Maxam and Gilbert and Sanger et al. in 1977

brought about more methods for measuring genetic distance. The estimates from these

methods are based on the rates of nucleotide substitution in DNA sequence data. These

are the methods which we will consider in this paper. We will formulate the general

Page 4: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 4

model, examine some special cases, give some numerical examples, and finally, examine

the validity of these models based on their assumptions.

The General Model

We now start by formulating the general model. Let S1 and S2 be two nucleotide

sequences with a common ancestral sequence. We consider a pair of homologous sites from

S1 and S2 and examine how much they have diverged from each other during their descent

from the ancestral sequence T years back (Figure 1).

The evolutionary base substitution model we are going to use is shown in Figure 2.

We have used RNA codes for the nucleotides so that the pyrimidines are uracil (U) and

cytosine (C), and the purines are adenine (A) and guanine (G). The types and rates of

base substitution are summarized in Table 1. A substitution of a purine by a purine or a

pyrimidine by a pyrimidine is called a transition (TS). If a pyrimidine is substituted by a

purine or vice-versa then the substitution is called a transversion (TV). We distinguish

between two types of transversion, TV1 and TV2, and each type is shown in Table 1. The

classification of the TV as to type becomes easier if we look at Figure 2. The TV which go

either vertically up or down are TV1 and those which go diagonally are TV2.

When comparing the homologous sites of S1 and S2 at any time t > 0, there are 16

possible nucleotide base pairings, 12 of which involve mismatched base pairs. If the

mismatch looks like a transition pair in Table 1, we call the mismatch a TS-type

mismatch. We have a TV1-type mismatch if the mismatch looks like a Type 1 tranversion

listed in Table 1. The TV2-type mismatch is defined in the same manner. We summarize

these in Table 2. In Table 2, for t > 0,

Page 5: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 5

S(t) =4∑

i=1

Si(t) = probability of no difference at a site (1)

P (t) =4∑

i=1

Pi(t) = probability of a TS-typedifference at a site (2)

Q(t) =4∑

i=1

Qi(t) = probability of a TV1-type difference at a site (3)

P (t) =4∑

i=1

Pi(t) = probability of a TTV2-type difference at a site (4)

Hence,

Q(t) + R(t) =4∑

i=1

(Ri(t) + Qi(t)) (5)

= probability of a TV-type difference at a site.

We sometimes refer to the probabilities above as the match probabilities.

We also define the following probabilities which we sometimes refer to as the base

probabilities.

U(t) = percentage frequency of uracil, (6)

C(t) = percentage frequency of cytosine, (7)

A(t) = percentgae frequency of adenine, (8)

T (t) = percentage frequency of thymine in a strand (9)

so that

U(t) + C(t) + A(t) + G(t) = 1. (10)

Note that the probabilities in (1) - (4) and (6) - (9) are all time-dependent. We also have

Page 6: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 6

the following relations:

S(t) = U2(t) + C2(t) + A2(t) + G2(t) (11)

P (t) = 2U(t)C(t) + 2A(t)G(t) (12)

Q(t) = 2U(t)A(t) + 2C(t)G(t) (13)

R(t) = 2U(t)G(t) + 2C(t)A(t) (14)

Using the rates of substitution and the match probabilities, the mean rate of substitution

at a specific site over the time interval (0,T] is given by

k =4∑

i=1

(αi + βi + γi

T

∫ T

0Bi(t) dt

)(15)

where B1(t) = U(t), B2(t) = C(t), B3(t) = A(T ) and B4(t) = G(t) and the integrals are

the average probabilities of finding a given base at a given site during the time interval

(0, T ].

A measure of genetic distance is therefore given by

K = 2Tk (16)

where k is as defined in (15), T is the time since the two sequences started diverging from

the ancestral sequence and the factor of 2 is due to the fact that we are considering two

branches that diverged.

We now formulate the general model and proceed in a manner similar to that of

Takahata and Kimura (1981). At any time t ∈ [0, T ], consider a short time interval ∆t,

short enough so that if the mutation rate is small then higher order terms of ∆t and the

occurrence of a double substitution at a specific site may be neglected. We have

U(t + ∆t) = U(t)− α1(∆t)U(t) + α2(∆t)C(t) + β2(∆t)A(t) +

γ2(∆t)U(t)− γ1(∆t)U(t)− β1(∆t)U(t) (17)

Page 7: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 7

which we can rewrite as

U(t + ∆t)− U(t)∆t

= − (α1 + β1 + γ1) U(t) + α2C(t) + β2A(t) + γ2G(t). (18)

Getting the limit as ∆t approaches zero, (18) gives

dU(t)dt

= − (α1 + β1 + γ1) U(t) + α2C(t) + β2A(t) + γ2G(t). (19)

Doing this for the other three probabilities we get the following system of differential

equations:

dU(t)dt

= −(α1 + β1 + γ1)U(t) + α2C(t) + β2A(t) + γ2G(t) (20)

dC(t)dt

= α1U(t)− (α2 + β3 + γ3)C(t) + γ4A(t) + β4G(t) (21)

dA(t)dt

= β1U(t)− γ3C(t)− (α3 + β2 + γ4)A(t) + α4G(t) (22)

dG(t)dt

= γ1U(t) + β3C(t) + α3A(t)− (α4 + β4 + γ2)G(t). (23)

Writing (20) – (23) in matrix form gives

d

dt

U(t)

C(t)

A(t)

G(t)

=

−(α1 + β1 + γ1) α2 β2 γ2

α1 −(α2 + β3 + γ4) γ4 β4

β1 γ3 −(α3 + β2 + γ4) α4

γ1 β3 α3 −(α4 + β4 + γ2)

U(t)

C(t)

A(t)

G(t)

. (24)

Using fact that the sum of the base probabilities is equal to 1, the matrix equation

reduces to

d

dt

U(t)

C(t)

A(t)

=

−(α1 + β1 + γ1 + γ2) α2 − γ2 β2 − γ2

α1 − β4 −(α2 + β3 + γ4 + β4) γ4 − β4

β1 − α4 γ3 − α4 −(α3 + β2 + γ4 + α4)

U(t)

C(t)

A(t)

. (25)

which can be written as

d

dtB1(t) = Q1B1(t) + C1. (26)

Solving this system of differential equations entails solving for the eigenvalues of B1.

Although it is easy to get the eigenvalues of the 3× 3 matrix B1, the matrix equation in

(26) is still difficult to solve since only the final conditions of the baseprobabilities can be

approximated and the initial conditions are unknown. One way to avoid this problem is to

Page 8: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 8

express the base probabilities in terms of the match probabilities. The matrix equation

involving the match probabilities is easier to solve since the initial conditions for the

match probabilities are Pi(0) = Qi(0) = Ri(0) = 0, i = 1, . . . , 4 and S(0) = 1. After the

expressions for the match probabilities have been solved, we can solve for the mean rate of

base substitution k and hence the estimate of genetic distance K.

Inherent in these models of evolutionary base nucleotide substitutions are the

following four assumptions:

(1) The two sequences diverged from a common ancestor, that is, Pi(0) = Qi(0) =

Ri(0) = 0, i = 1, . . . , 4 and S(0) = 1.

(2) The two sequences are stochastically identical and independent, and within each

sequence, as substitution in one site in no way affects a substitution in some other site.

(3) The homologous sites chosen from the two sequences are of the same fixed length

during their descent from the common ancestor.

(4) (The fourth assumption reduces the number of parameters in the model by

assuming that some of the rates are equal. Since this differs among the three models that

we are going to consider, rather than stating it here, it will be stated as each model is

being considered.)

The 3ST Model

The first special case that we are going to consider is the three-substitution-type

(3ST) model. This model is due to Kimura (1981) and is the most general of the three

models we are going to consider in detail in this paper. The two other models we

considerlater are special cases of this model. The fourth assumption in the 3ST model is

that the TS-type substitutions all have rates α, and that the TV-type substitutions have

rates β and γ depending on the specific type as shown in Figure 3. Under the 3ST model,

Tables 1 and 2 can be simplified and their simplified forms are given below as Tables 3

and 4, respectively.

Page 9: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 9

The system of differential equations in (20) – (23) simplifies to

dU(t)dt

= −(α + β + γ)U(t) + αC(t) + βA(t) + γG(t) (27)

dC(t)dt

= αU(t)− (α + β + γ)C(t) + γA(t) + βG(t) (28)

dA(t)dt

= βU(t) = γC(t)− (α + β + γ)A(t) + αG(t) (29)

dG(t)dt

= γU(t) + βC(t) + αA(t)− (α + β + γ)G(t). (30)

and its corresponding matrix form is

d

dt

U(t)

C(t)

A(t)

G(t)

=

−(α + β + γ) α β γ

α −(α + β + γ) γ β

β γ −(α + β + γ) α

γ β α −(α + β + γ)

U(t)

C(t)

A(t)

G(t)

, (31)

which again can be written in the form of (25). Considering the fact that the sum of the

base probabilities is 1, we can simplify (31) to

d

dt

U(t)

C(t)

A(t)

=

−(α + β + 2γ) α− γ β − γ

α− β −(α + 2β + γ) γ − β

β − α γ − α −(2α + β + γ)

U(t)

C(t)

A(t)

. (32)

We can also rewrite (32) in the form of (25). The matrix equation in (32) is not

difficult to solve since the eigenvalues are easily obtainable. The problem here is that we

do not know the initial conditions for the base probabilities since we do not know the base

frequencies of the ancestral sequence. As we have mentioned before, a way to avoid this

problem is to consider the match probabilities instead. It is easier to use the match

probabilities since we have the initial conditions for this set of probabilities given by the

first assumption (A1) of our model.

Using the relationships between the base probabilities and the match probabilities

given in (11) – (14) it can be shown that

d

dt

P (t)

Q(t)

R(t)

=

−2(2α + β + 2γ) −2(α− γ) −2(α− β)

−2(α− β) −2(α + 2β + γ) −2(β − α)

−2(γ − β) −2(γ − α) −(α + β + 2γ)

P (t)

Q(t)

R(t)

+

. (33)

which in matrix form is

d

dtT(t) = Q2T(t) + C2. (34)

Page 10: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 10

We now derive the expression for P (t) in (33). The expressions for Q(t) and R(t) can be

obtained in very much the same manner.

Recall that in (11) – (14) we have

P (t) = probability of a TS-type difference at a homologous site (35)

= 2C(t)U(t) + 2A(t)G(t). (36)

Using the product-rule for the derivative of a product,

dP (t)dt

= 2[C(t)

dU(t)dt

+ U(t)dC(t)

dt

]+ 2

[A(t)

dG(t)dt

+ G(t)dA(t)

dt

]. (37)

If we substitute the expressions for the derivatives of the match probabilities we obtained

in (33) we have

dP (t)dt

= 2 {−2 (C(t)U(t) + A(t)G(t)) (α + β + γ) + 2β (A(t)C(t) + G(t)U(t))+

2γ (A(t)U(t) + G(t)C(t)) + α(A2(t) + C2(t) + U2(t) + G2(t)

)}(38)

Using the fact that A2(t) + C2(t) + U2(t) + G2(t) = 1- P (t) - Q(t) -R(t) we can

simplify (38) to obtain

dP (t)dt

= 2− {−(2α + β + γ)P (t) + (β − α)R(t) + (γ − α)Q(t) + 2α} (39)

which is what we want.

We now solve the matrix equation in (34). Define the following Laplace transform:

L[T(t)] = L

P (t)

Q(t)

R(t)

=

p(s)

q(s)

r(s)

= T (s). (40)

Applying the Laplace transform to (34), we get

sT (s)−T(0) = Q3T (s) +1sC3 (41)

which we can rewrite as

−1sC3 = (Q− sI3)T (s), (42)

Page 11: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 11

where we have used the fact that T(0)= 0 and I3 is the 3× 3 identity matrix. The

problem of solving the system of differential equations in (34) is now reduced to solving a

system of algebraic equations in the three unknowns p(s), q(s), and r(s). We now solve for

these three unknowns and then apply the inverse Laplace transform to get the solutions

for P (t), Q(t), and R(t). Using Cramer’s rule, we get

p(s) =

∣∣∣∣∣∣∣∣∣∣∣−2α/s −2(α− γ) −2(α− β)

−2β/s −2(α + 2β + γ) −2(β − α)

−2γ/s −2(γ − α) −2(α + β + 2γ)− s

∣∣∣∣∣∣∣∣∣∣∣∆

(43)

q(s) =

∣∣∣∣∣∣∣∣∣∣∣−2(2α + β + γ) −2α/s −2(α− β)

−2(β − γ) −2β/s −2(β − α)

−2(γ − β) −2γ/s −2(α + β + 2γ)− s

∣∣∣∣∣∣∣∣∣∣∣∆

(44)

r(s) =

∣∣∣∣∣∣∣∣∣∣∣−2(2α + β + γ)− s −2(α− γ) −2α/s

−2(β − γ) −2(α + 2β + γ) −2β/s

−2(γ − β) −2(γα) −2γ/s

∣∣∣∣∣∣∣∣∣∣∣∆

(45)

where,

∆ =

∣∣∣∣∣∣∣∣∣∣∣−2(2α + β + γ) −2(α− γ) −2(α− β)

−2(β − γ) −2(α + 2β + γ) −2(β − α)

−2(γ − β) −2(γ − α) −2(α + β + 2γ)

∣∣∣∣∣∣∣∣∣∣∣. (46)

Upon simplifying and expressing the results in partial fractions we get,

p(s) =14s

−14

s + 4(α + β)−

14

s + 4(α + γ)+

14

s + 4(β + γ)(47)

q(s) =14s

−14

s + 4(α + β)+

14

s + 4(α + γ)−

14

s + 4(β + γ)(48)

r(s) =14s

+14

s + 4(α + β)−

14

s + 4(α + γ)−

14

s + 4(β + γ). (49)

Page 12: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 12

Applying the inverse Laplace transform, we get the following as solutions to the

system in (49),

P (t) = L−1{p(s)} =14

(1− eλ1t − eλ2t + eλ3t

)(50)

Q(t) = L−1{q(s)} =14

(1− eλ1t + eλ2t − eλ3t

)(51)

R(t) = L−1{r(s)} =14

(1 + eλ1t − eλ2t − eλ3t

), (52)

where λ1 = −4(α+β), λ2 = −4(α+γ), λ3 = −4(β+γ).

Under the 3ST model, the equation for k in (15) can be expressed as

k =4∑

i=1

(α + β + γ

T

∫ T

0Bi(t) dt

)= α + β + γ, (53)

where we have used the fact that the sum of the base probabilities is equal to 1. Note that

the assumption on some of the rates being equal played a crucial role in being able to

factor α+β+γ out of the summation to get a simple expression for k. For K, we obtain

K = 2T (α + β + γ). (54)

We can solve (52) for λ1, λ2, and λ3 to get

4(α + β)t = − ln(1− 2P (t)− 2Q(t)) (55)

4(α + γ)t = − ln(1− 2P (t)− 2R(t)) (56)

4(β + γ)t = − ln(1− 2Q(t)− 2R(t)), (57)

and hence, for any time t ∈ [0, T ],

8(α + β + γ)t = − ln {[1− 2P (t)− 2Q(t)][1− 2P (T )− 2R(t)][1− 2Q(t)− 2R(t)]} (58)

K = 2kt (59)

= −14

ln {[1− 2P (t)− 2Q(t)][1− 2P (T )− 2R(t)][1− 2Q(t)− 2R(t)]} . (60)

The variance for this estimate of K is also given in the paper of Kimura (1981). We

have,

σ2K =

1n

{a2P (t) + b2Q(t) + c2R(t)− (aP (t) + bQ(t) + cR(t))2

}(61)

Page 13: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 13

where,

a =12

(1

1− 2P (t)− 2Q(t)+

11− 2P (t)− 2Q(t)

)(62)

b =12

(1

1− 2P (t)− 2Q(t)+

11− 2Q(t)− 2R(t)

)(63)

c =12

(1

1− 2P (t)− 2R(t)+

11− 2Q(t)− 2R(t)

). (64)

The 2ST Model

We now proceed to a special case of this model which again is due to Kimura

(1980). We will call this model the two-substitution type model. The

two-substitution-type (2ST) was discussed by Kimura in a paper which was published a

year previous to the 3ST model. The 2ST model is a special case of the 3ST model and

hence we just give the results and do not gointo the details. (In the original paper, this

model is actually nameless. We just call it the 2ST model for convenience). The fourth

assumption here is that the transition rate is α and the transversion rate is β. Under this

assumption the diagram in Figure 3 simplifies further to the diagram in Figure 4.

The tables for the base substitution and the match probabilities are given as Tables

5 and 6 below. The probability of a TS-type mismatch is given by P (t) and the

probability of a TV-type mismatch is given by QR(t) = Q(t)+ R(t). That is, we have

lumped together the TV1-type and TV2-type mismatches.

The matrix equation in (24) under the 2ST model is

d

dt

U(t)

C(t)

A(t)

G(t)

=

−(α + 2β) α β β

α −(α + 2β) β β

β β −(α + 2β) α

β β α −(α + 2β)

(65)

and the corresponding matrix equation involving the match probabilities is

d

dt

P (t)

Q(t)

R(t)

=

−2(2α + 2β) −2(α− β) −2(α− β)

0 −2(α + 3β) −2(β − α)

0 −2(β − α) −2(α + 3β)

. (66)

Page 14: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 14

If we now lump Q(t) and R(t) together as QR(t) we have the matrix equation in (67)

which only involves a 2 × 2 matrix instead of the previous 3 × 3 matrix. P (t)

QR(t)

=

−2(2α + β + γ −2(α− β)

0 8β

P (t)

QR(t)

+

(67)

To solve (67), we use the initial conditions: P (0) = QR(0) = 0. As solutions we have

P (t) =14− 1

2eλ1t +

14eλ2t (68)

QR(t) =12− 1

2eλ2t (69)

where λ1 = −4(α+β) and λ2 = −8β.

Under the 2ST model k = α + 2β. We can solve (69) for αt and βt and therefore

obtain our estimate K. We have

K = 2kt = 2(α + 2β) (70)

= −14

ln{[1− 2P (t)−QR(t)]2[1− 2QR(t)]

}. (71)

The variance of this estimate is given

σ2K =

1n

{a2P (t) + b2QR(t)− (aP (t) + bQR(t))2

}(72)

where

a =1

1− 2P (t)− 2QR(t)(73)

b =12

(1

1− 2P (t)− 2QR(t)+

11− 2QR(t)

). (74)

The Jukes-Cantor Model

The simplest possible model is due to Jukes and Cantor (1969). The model was

primarily formulated to describe protein evolution by looking at the rate of amino acid

substitution. It turns out that this model can also be used to describe base substitution.

The fourth assumption here is that all the rates of substitution are equal, i.e., α =

αi = βi = γi, i = 1, . . ., 4. Figure 2 then becomes Figure 5 below. Under the

Jukes-Cantor model, Tables 1 and 2 can be simplified to Tables 7 and 8, respectively.

Page 15: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 15

The matrix equation in (24) under the Jukes-Cantor model is

d

dt

U(t)

C(t)

A(t)

G(t)

=

−3α α α α

α −3α α α

α α −3α α

α α α −3α

U(t)

C(t)

A(t)

G(t)

(75)

and the matrix equation involving the match probabilities is

d

dt

P (t)

Q(t)

R(t)

=

−8β 0 0

0 −8β 0

0 0 −8β

P (t)

Q(t)

R(t)

+

(76)

If we define PQR(t) = P (t) + Q(t) + R(t) we have the differential equation

d

dtPQR(t) = −8αPQR(t) + 6α (77)

which has as a solution

PQR(t) =34

(1− e−8αt

). (78)

Under the Jukes-Cantor model, k = 3α and the estimate K is

K = 2kt = 6αt = −34

ln(1− 43PQR(t)) (79)

which can be obtained by solving for α in (78).

The variance for K under the Jukes-Cantor model was derived by Kimura and Ohta

(1972) and is given by

σ2JC =

1n

{(1− PQR(t))PQR(t)

1− 4PQR(t)/3

}=

(1− PQR(t))PQR(t)n(1− 4PQR(t)/3)

. (80)

We are going to illustrate the three models by comparing the human and protein

kinase inhibitor. These two nucleotide sequences were recently sequenced by Olsen and

Uhler (1991). The sequences are more than a thousand base pairs long but only 231 of

these are part of the coding region. Our analysis is limited to these 231 base pairs. The

sequences are shown in Figure 6. Of the 231 bp, only 15 show mismatches. These are

Page 16: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 16

summarized in Table 9. Usually, the estimate K is computed by codon position since

there is that assumption that the substitution are independent of each other but there is

evidence that adjacent substitutions are actually not independent of each other. This will

not be done here since we have quite a small amount of base pairs and the mismatches are

quite far apart (except for the ones occurring at positions 200 and 201).

The estimate under each model is shown in Table 10. It is seen here that the

estimates do not differ so much from one model to the other. The variances are also not

that different from each other.

Estimates of genetic distance using some other nucleotide sequences are also

available. Tavar (1986) obtained estimates using human and mouse a-fetoprotein and

serum albumin nucleotide sequences. The results he got for the human-mouse

α-fetoprotein nucleotide sequences are reproduced below as Table 11. The data consist of

1824 base pairs and hence it was possible for him to compute the estimates by codon

positions.

Note that the estimates tend to be bigger for the third codon position and smallest

for the second codon position. Tavar in his paper showed that the estimates are not

homogeneous if we consider the codon positions as strata. Unfortunately, we cannot do

the same thing in our analysis here since we just have 231 bp and 15 mismatches.

All three models of evolutionary base substitutions that we have discussed here are

far from perfect and their weaknesses lie on the second and third assumptions made to

formulate the models.

The second assumption states that the nucleotide sequences are stochastically

identical and independent of each other. It is most possibly true that nucleotide sequences

evolve in a manner stochastically independent from each other but there are evidences

that they are in fact not stochastically identical. For example, Wu and Li (1985) noticed

that the substitution rates in rodent is much higher than that in humans. Even within a

sequence, there is evidence that that rates are much higher in some spots (“hot spots”)

than in others (Miyata and Yasunaga, 1981; Brown and Clegg, 1983) and that the rates

differ between the sense and antisense strand (Wu and Maeda, 1987). There are also

evidences showing that a substitution in one site does a affect the rate of substitution in

an adjacent site in phage T4 (Koch, 1971). It would be interesting to know if the same

Page 17: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 17

holds for higher organisms. This last fact is also one of the reasons why substitution rates

are computed by codon sites if the data allow.

The third assumption assumes that the diverging nucleotide sequences are both of a

fixed length and hence it doesn’t take into account mutations resulting from deletions and

insertions. These assumption also does not take into account the possibility of concerted

evolution, which brings about the presence of multigene families, and the duplication and

divergence in multigene families.

There have been efforts to consider models which incorporate these shortcomings

but at the same time still make the models mathematically tractable. Needleman and

Wunsch (1970), for example, proposed a model which assigns weights to substitutions,

insertions and deletions. Unfortunately, the weights assigned were arbitrary and had no

genetic basis.

The main problem that these models of evolutionary base nucleotide substitution

face is that when all of the mechanisms of evolution are included in the model, the model

becomes mathematically intractable with the present computer technology. Considering

the fact that computer technology is still advancing, it is hoped that a model incorporating

most, if not all, of the mechanisms discussed can be formulated in the near future.

Page 18: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 18

References

Brown, A., & Clegg, M. (1983). Analysis of variation in related DNA sequences. In

B. Weir (Ed.), Statistical data analysis (pp. 107–132). New York: Marcel-Dekker.

Cavalli-Sforza, L., & Bodmer, W. (1971). The genetics of human populations. San

Francisco: W. H. Freeman.

Cavalli-Sforza, L., & Edwards, A. (1967). Phylogenetic analysis: models and estimation

procedures. American Journal of Human Genetics, 19 , 233–257.

Edwards, A. (1971). The distance between populations on the basis of gene frequencies.

Biometrics, 27 , 873–881.

Jukes, T., & Cantor, C. (1969). Evolution of protein molecules. In H. N. Munro (Ed.),

Mammalian protein metabolism (pp. 21–123). New York: Academic Press.

Kimura, M. (1980). A simple method for estimating evolutionary rates of base

substitutions through comparative studies of nucleotide sequences. Journal of

Molecular Evolution, 16 , 11–120.

Kimura, M. (1981). Estimation of evolutionary distances between homologous nucleotide

sequences. Proceedings of the National Academy of Sciences USA, 78 , 454–458.

Kimura, M., & Ohta, T. (1972). On the stochastic model for estimation of mutational

distance between homologous proteins. Journal of Molecular Evolution, 2 , 87–90.

Koch, R. (1971). The influence of neighbouring base pairs upon base-pair substitution

mutation rates. Proceedings of the National Academy of Sciences USA, 68 , 773–776.

Maxam, A., & Gilbert, W. (1977). A new method for sequencing DNA. Proceedings of the

National Academy of Sciences USA, 74 , 560–564.

Miura, R. (Ed.). (1986). Lectures on mathematics in the life sciences. Rhode Island:

American Mathematical Society.

Miyata, T., & Yasunaga, T. (1981). Rapidly evolving mouse α-globin-related

pseudogenes. Proceedings of the National Academy of Sciences USA, 78 , 450–453.

Page 19: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 19

Munro, H. N. (Ed.). (1969). Mammalian protein metabolism. New York: Academic Press.

Needleman, S., & Wunsch, C. (1970). A general method applicable to the search for

similarities in the amino acid sequence of two proteins. Journal of Molecular

Biology , 48 , 443–453.

Nei, M. (1977). F-statisitcs and analysis of gene diversity in subdivided populations.

Annals of Human Genetics, 41 , 225–233.

Olsen, S., & Uhler, M. (1991a). (nucleotide sequence of the human protein kinase

inhibitor). Molecular Endocrinology . (manuscript submitted)

Olsen, S., & Uhler, M. (1991b). (nucleotide sequence of the mouse protein kinase

inhibitor). Journal of Biological Chemistry . (in press)

Sanger, F., Nicklen, S., & Coulson, A. (1977). DNA sequencing with chain-terminating

inhibitors. Proceedings of the National Academy of Sciences USA, 74 , 4563–4567.

Takahata, N., & Kimura, M. (1981). A model of evolutionary base substitutions and its

application with special reference to rapid change in pseudogenes. Genetics, 98 ,

641–657.

Tavare, S. (1986). Some probabilistic and statistical problems in the analysis of DNA

sequences. In R. Miura (Ed.), Lectures on mathematics in the life sciences (pp.

57–86). Rhode Island: American Mathematical Society.

Weir, B. (Ed.). (1983). Statistical data analysis. New York: Marcel-Dekker.

Weir, B. (1990). Genetic data analysis: methods for discrete population data. Sunderland,

Massachussetts: Sinauer Associates.

Wu, C., & Li, W. (1985). Evidence for higher rates of nucleotide substitution in rodents

than in man. Proceedings of the National Academy of Sciences USA, 82 , 1741–1745.

Wu, C., & Maeda, N. (1987). Inequality in mutation rates of the two strands of DNA.

Nature, 327 , 169–170.

Page 20: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 20

Table 1

Types and rates of nucleotide sustitution.

Types

Transition (TS) Transversion (TV1) Transversion (TV2)

Initial base U C A G U A C G U G C A

New Base C U G A A U G C G U A C

Rates α1 α2 α3 α4 β1 β2 β3 β4 γ1 γ2 γ3 γ4

Page 21: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 21

Table 2

Possible nucleotide base pairings at a specific homologius site for t > 0.

Types

Sequence Same TS-type TV1-type TV2-type

1 U C A G U C A G U A C G U G C A

2 U C A G C U G A A U G C G U A C

Probabilities S1 S2 S3 S4 P1 P2 P3 P4 Q1 Q2 Q3 Q4 R1 R2 R3 R4

Page 22: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 22

Table 3

Types and rates of nucleotide sustitution under the 3ST model.

Types

Transition (TS) Transversion (TV1) Transversion (TV2)

Initial base U C A G U A C G U G C A

New Base C U G A A U G C G U A C

Rates α α α α β β β β γ γ γ γ

Page 23: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 23

Table 4

Possible nucleotide base pairings at a specific homologius site for t > 0 under the 3ST model.

Types

Sequence Same TS-type TV1-type TV2-type

1 U C A G U C A G U A C G U G C A

2 U C A G C U G A A U G C G U A C

Probabilities S P Q R

Page 24: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 24

Table 5

Types and rates of nucleotide sustitution under the 2ST model.

Types

Transition (TS) Transversion (TV1) Transversion (TV2)

Initial base U C A G U A C G U G C A

New Base C U G A A U G C G U A C

Rates α α α α β β β β β β β β

Page 25: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 25

Table 6

Possible nucleotide base pairings at a specific homologius site for t > 0 under the 2ST model.

Types

Sequence Same TS-type TV1-type TV2-type

1 U C A G U C A G U A C G U G C A

2 U C A G C U G A A U G C G U A C

Probabilities S P QR

Page 26: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 26

Table 7

Types and rates of nucleotide sustitution under the Jukes-Cantor model.

Types

Transition (TS) Transversion (TV1) Transversion (TV2)

Initial base U C A G U A C G U G C A

New Base C U G A A U G C G U A C

Rates α α α α α α α α α α α α

Page 27: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 27

Table 8

Possible nucleotide base pairings at a specific homologius site for t > 0 under the Jukes-

Cantor model.

Types

Sequence Same TS-type TV1-type TV2-type

1 U C A G U C A G U A C G U G C A

2 U C A G C U G A A U G C G U A C

Probabilities S PQR

Page 28: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 28

Table 9

Nucleotide mismatches observed after time T since divergence between human and mouse

protein kinase inhibitor (pki).

Types

Transition (TS) Transversion (TV1) Transversion (TV2)

Human pki U C A G U A C G U G C A

Mouse pki C U G A A U G C G U A C

Numbers observed 5 0 3 2 0 1 1 6 0 1 1 2

Page 29: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 29

Table 10

Estimates of the genetic distance K under the different models being considered.

Model K standard error

Jukes-Cantor 0.0682288 0.0178312

2ST 0.0686475 0.0180611

3ST 0.0686535 0.0180644

Page 30: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 30

Table 11

Estimates of the genetic distance Ki, where i = 1, 2, or 3, is the ith codon position, under

the different models considered in Tavare (1986). The sequence data are that of human and

mouse α-fetoprotein.

Model K1 K2 K3

Jukes-Cantor 0.1752 (.0186) 0.1387 (.0162) .6566 (.0483)

3ST 0.1760 (.0188) 0.1389 (.0163) .7230 (.0642)

(The parenthesized quantities are standard errors.)

Page 31: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

DNA nucleotide substitution models 31

Figure Captions

Figure 1. Divergence of sequences S1 and S2 from some common ancestor.

Figure 2. Types and rates of nucleotide substitutions.

Figure 3. Types and rates of nucleotide substitutions: 3ST Model.

Figure 4. Types and rates of nucleotide substitutions: 2ST Model.

Figure 5. Types and rates of nucleotide substitutions: Jukes-Cantor Model.

Figure 6. The nucleotide sequences of the coding region of the mouse protein kinase

inhibitor (Mpki.M) and the human protein kinase inhibitor (Hpki.2) are shown above.

The 15 mismatches are indicated with bars (Olsen and Uhler, 1991a, 1991b).

Page 32: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

Ancestral sequence

S1 S2

T T

BBBBBBBBBN

��

��

��

���

Page 33: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

A G

U C

β1 β2 β3 β4

α1

α2

α3

α4

γ1

γ2

γ3

γ4��

��

��

����

��

��

��

�@

@@

@@

@@

@I@

@@

@@

@@

@R

6 6

? ?

-

-

Page 34: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

A G

U C

β β β β

α

α

α

α

γ

γ

γ

γ�

��

��

��

���

��

��

��

�@

@@

@@

@@

@I@

@@

@@

@@

@R

6 6

? ?

-

-

Page 35: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

A G

U C

β β β β

α

α

α

α

β

β

β

β�

��

��

��

���

��

��

��

�@

@@

@@

@@

@I@

@@

@@

@@

@R

6 6

? ?

-

-

Page 36: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution

A G

U C

α α α α

α

α

α

α

α

α

α

α�

��

��

��

���

��

��

��

�@

@@

@@

@@

@I@

@@

@@

@@

@R

6 6

? ?

-

-

Page 37: On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution