cs262 problem set 4 - stanford university...problem 1b ag ac tc tg s1 s2 t-g- t--c tagc tagc -a-c...

CS262 Problem Set 4

Problem 1

Problem 1a

Sequences: A, A, T, T

A

A

T

T

-A

-A

T-

T-

A-

A-

-T

-T

Optimal alignments: vs.

Sum-of-pairs scores:

Consensus scores:

0 -4

2 4

Parameters: {+2, -1, -1}

>

<

Problem 1b

Distinct Sequences: AC, AG, TC, TG Parameters: {+2, -1, -1}

AC

AG

TC

TG

-AC-

-A-G

T-C-

T--G

Optimal alignments: vs.

Sum-of-pairs scores:

Consensus scores:

0 -8

4 8

>

<

A-C-

A--G

-TC-

-T-G

-A-C

-AG-

T--C

T-G-

A--C

A-G-

-T-C

-TG-

Problem 1b

AG

AC TC

TG

S1 S2

T-G-

T--C

TAGC

TAGC

-A-C

-AG-

Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC)

Different to Sum-of-Pairs alignment.

Phylogenetic score:

TC

TG

AC

AC

AC

AG

T--G

T-C-

TACG

TACG

-AC-

-A-G

-TG-

-T-C

ATGC

ATGC

A--C

A-G-

-T-G

-TC-

ATCG

ATCG

A-C-

A--G

S1:

S2:

24 24 24 24 8

Problem 2

Problem 2a

y z w

x 2 5 3

y 3 1

z 2

S(u, v)

x: TACCCGAT

y: TAAACGAT

z: AAAACGCG

w: AAAACGAT

Problem 2a

y z w

x 2 5 3

y 3 1

z 2

S(u, v) Ultrametricity a, b, c: if S(a,b) = min[S(a,b), S(a,c), S(b,c)] then S(a,b) S(a,c) = S(b,c) S(x,y) = 2, S(x,z) = 5, S(y,z) = 3 so S(x,y) = min[S(x,y), S(x,z), S(y,z)] but S(x, z) = 5 3 = S(y, z) not ultrametric

Problem 2a

UPGMA

y w 1/2 1/2

y z w

x 2 5 3

y 3 1

z 2

S(u, v)

Problem 2a

UPGMA

y w 1/2 1/2

5/4 5/4

x

z yw

x 5 ½(2+3)

z ½(3+2)

S(u, v)

Problem 2a

UPGMA

y w 1/2 1/2

5/4 5/4

x

5/3 5/3

z

xyw

z (1/3)(5+3+2)

S(u, v)

Problem 2b

y w x z

x: TACCCGAT

y: TAAACGAT

z: AAAACGCG

w: AAAACGAT

Problem 2b

T A T A

{A,T}

{T}

{A,T}

A A C A

{A}

{A,C}

{A}

A A A C

{A}

{A}

{A,C}

x: TACCCGAT

y: TAAACGAT

z: AAAACGCG

w: AAAACGAT

Total cost = 6 mutations

Problem 2b

x y z w

x: TACCCGAT

y: TAAACGAT

z: AAAACGCG

w: AAAACGAT

Problem 2b

T T A A

{T}

{A,T}

{A}

C A A A

{A,C}

{A}

{A}

A A C A

{A}

{A}

{A,C}

x: TACCCGAT

y: TAAACGAT

z: AAAACGCG

w: AAAACGAT

Total cost = 5 mutations

Problem 2c

2 3 4

1 0.3 0.5 0.6

2 0.6 0.5

3 0.9

D(i,j) 1 2

3 4

0.1 0.1 0.1

0.4 0.4

Problem 2c

Neighbor-joining r(i) = (kD(i,j))/(|L|-2) d(i,j) = D(i,j) - r(i) - r(j)

1 0.7

2 0.7

3 1.0

4 1.0

2 3 4

1 0.3 0.5 0.6

2 0.6 0.5

3 0.9

D(i,j)

2 3 4

1 -1.1 -1.2 -1.1

2 -1.1 -1.2

3 -1.1

d(i,j)

r(i)

Problem 2c

2 3 4

1 -1.1 -1.2 -1.1

2 -1.1 -1.2

3 -1.1

d(i,j) 1 2

3 4

0.1 0.1 0.1

0.4 0.4

Problem 3

(a)

d = −3

4ln

(

1−4

3p

)

For p = 0.05, d = 0.0517. For p = 0.2, d = 0.2326 > 4 × 0.0517 = 0.2068. d is the average number ofsubstitutions per site, while p is the probability that a given site will differ between the two genomes. Sinceit is possible that mutliple substitutions occur at the same site, d ≥ p. For small p, the difference between p

and d is small. As the evolutionary distance between the two genomes increases, i.e. as p increases, multiplesubstitutions at the same site become more likely, and the difference between p and d increases. This isintuitively why, when p increases 4 times, d increases more than 4 times.

(b)

r(tXY )r(tY Z) + 3s(tXY )s(tY Z) =1

4

1

4

(

1 + 3e−µtXY

) (

1 + 3e−µtY Z

)

+ 31

4

1

4

(

1− e−µtXY

) (

1− e−µtY Z

)

=1

16

(

1 + 3e−µtXY + 3e−µtY Z + 9e−µ(tXY +tY Z) + 3− 3e−µtXY − 3e−µtY Z + 3e−µ(tXY +tY Z))

=1

16

(

4 + 12e−µ(tXY +tY Z))

=1

4

(

1 + 3e−µ(tXY +tY Z))

= r(tXY + tY Z)

(c)

Using the fact that r = 1− p and r + 3s = 1:

1− pXZ = (1− pXY )(1− pY Z) + 3pXY

3

pY Z

3

= 1− pXY − pY Z + pXY pY Z +pXY pY Z

3

= 1−

(

pXY + pY Z −4

3pXY pY Z

)

(d)

dXY + dY Z = −3

4ln

(

1−4

3pXY

)

−3

4ln

(

1−4

3pY Z

)

= −3

4ln

(

1−4

3pXY

)(

1−4

3pY Z

)

= −3

4ln

(

1−4

3pXY −

4

3pY Z +

4

3pXY

4

3pY Z

)

= −3

4ln

(

1−4

3

(

pXY + pY Z −4

3pXY pY Z

))

= −3

4ln

(

1−4

3pXZ

)

1

Problem 4a.i

start stop

n

( ) ( )

0.04687564/3

64/164/164/1

)TGA()TAG()TAA(

)TGATAGTAA(

0,1

0,01

==++=

++=∪∪=

>−=

== −

PPP

Ps

nss

nnLP n

For the first part of problem 4, you were asked to derive some ORF length distribution statistics and do some simple calculations on them. For the first part, suppose we have a sequence of completely random nucleotides and we select a random ORF from it. We asked you to show what the probability distribution of lengths for such an ORF is, where the length is the number of codons including the start codon but excluding the stop codon. This probability distribution is clearly 0 for length 0 (since we need at least 1 codon for an ORF). For any non-zero length, the probability is simply the product of the probability of having (n-1) codons that are all not a stop codon, followed by one stop codon. The variable s is the probability that a random codon is a stop codon.

Problem 4a.ii

( )

( ) ( )

( )( ) 0.86265%11

111

111

)100(

99

98

0

99

1

100

≈−−−−

−=

−−==−=

==≥

∑∑

∑

==

∞

=

s

ss

ssnLP

nLPLP

k

k

n

n

Then, we asked you to compute the probability that the length of the random ORF is at least 100 codons. This is simply the sum of our previous probability distribution over lengths 100 to infinity, which is the same as 1 minus the sum of the distribution from 1 to 99. This is simply a geometric sum and can be computed as shown in this slide. As you can see, in random sequence, we’d expect to see an ORF of at least 100 codons very rarely.

Problem 4a.iii

start stop

n

( ) ( )

032.0

)2.0)(3.0)(2.0()3.0)(2.0)(2.0()2.0)(2.0)(2.0(

)TGA()TAG()TAA(

)TGATAGTAA(

0,1

0,01

=++=

++=∪∪=

>−=

== −

PPP

Ps

nss

nnLP n

For problem 4(a)iii, we essentially have the same distribution, except that the probability of having a particular codon be a stop codon, given by s, is slightly lower.

Problem 4a.iii

( )

( ) ( )

( )( ) 3.99632%11

111

111

)100(

99

98

0

99

1

100

≈−−−−

−=

−−==−=

==≥

∑∑

∑

==

∞

=

s

ss

ssnLP

nLPLP

k

k

n

n

With this change, the probability of seeing an ORF in a random sequence of length 100 codons or longer is increased to almost 4%.

Problem 4b.i

( )( )

( )codons 100ORF no1

codons 100ORF 1least at

gene 1least at predicting

≥−=≥=

P

P

P

iOi position at starting codons 100 ORFan event thatLet ≥∃=

( )( )

∩∩∩∩∩∩∩∩∩∩∩

=

∩∩∩=

≥

−

963

852

741

29921

codons 100ORF no

OOO

OOO

OOO

P

OOOP

P

L

For problem 4(b) you were asked to prove an upper bound on the probability of falsely predicting a gene in a noncoding region with random bases. First, to compute this probability, we first notice that this probability is 1 minus the probability that there are no falsely predicted genes, i.e. that there are no ORFs of length at least 100 codons. Now, let’s define Oi to be the event that there exists an ORF starting at position i that is at least 100 codons long. In other words, there is a start codon at position i, and for 99 codons thereafter there are no stop codons. “O” with a bar on top of it is the complement of the event. Next, let’s compute the probability that there are no ORFs of at least 100 codons. This is equal to the probability that there is no such ORF at position 1, nor position 2, and so forth until position (L-299), after which we could not possibly fit an ORF of length 100 codons. Let’s rearrange the terms within the probability so that we put the terms within the same frame together, such as O1, O4, O7, etc.

Problem 4b.i

( ) ( ) ( ) ∩∩∩∩=∩∩∩ 74741741 OOPOOOPOOOP

( ) ( )

( ) ( )1741

1741 :Claim

OPOOOP

OPOOOP

≤∩∩

≥∩∩

Next, let’s split up this probability by using Bayes’ rule. We can pull out O1 as shown in this equation. Why would we want to do this? Well, my claim is that the probability of not having a long ORF at position 1 given that we know we don’t have long ORFs at positions 4, 7, etc is greater than or equal to the probability of not having a long ORF at position 1 given no extra information. Equivalently, the probability of having a long ORF at position given the extra information is less than or equal to the probability of having a long ORF at position 1 given no extra information.

Problem 4b.i

( ) ( ) ( )( ) ( )1at codon start 1at codon start ,

1at codon start 1at codon start ,

741

741741

POOOP

POOOPOOOP

∩∩+

¬¬∩∩=∩∩

( ) ( )

( ) ( )1at codon start 1at codon start ,


1741

1741

OPOOOP

OPOOOP

≤∩∩

¬==¬∩∩

How can we show this? Well, let’s take this probability and split it up into two possibilities: either there is a start codon at position one or there isn’t one. Suppose there isn’t a start codon, well then we obviously don’t have a long ORF at position 1 so therefore the probability is 0 and isn’t influenced by the extra information. Alternatively, suppose there is a start codon at position 1. I claim that we are much less likely to have a long ORF given that positions 4, 7, and so forth don’t have long ORFs, because it is likely that those events will have failed because of a stop codon along the way. Therefore, the probability of having a long ORF at position 1 given the extra information is less than the probability without the extra condition.

Problem 4b.i

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )

( )1

1

1

741

741741

1at codon start 1at codon start

1at codon start 1at codon start



OP

POP

POP

POOOP

POOOPOOOP

=

+

¬¬≥

∩∩+

¬¬∩∩=∩∩

( ) ( )

( ) ( )1741

1741

OPOOOP

OPOOOP

≥∩∩

≤∩∩

Now, returning to the previous equation, we can see that the probability of having a long ORF at position 1 conditioned on not having long ORFs at positions 4, 7, etc is at least as much as the probability of having a long ORF at position 1 with no condition. This completes my earlier claim.

Problem 4b.i

( ) ( ) ( )( ) ( )

( ) ( ) ( )29999

29921

741

74741741

64

61

64

11

−

−

−=

≥

⇓

∩∩≥

∩∩∩∩=∩∩∩

L

LOPOPOP

OOPOP

OOPOOOPOOOP

( ) ( )1741 OPOOOP ≥∩∩

Now, given this inequality, we can rewrite our previous equation where we extracted O1 to become an inequality. We can continue this process for all the remaining codons. You may be wondering about inter-frame dependencies and how that affects this analysis, but don’t worry about it. We don’t expect you to have proved this part rigorously. For each particular event Oi, the probability of not having a long ORF there is one minus the probability of having a long ORF there, which is the probability of having a start codon followed by 99 non-stop codons. Since each of the events Oi have identical probability distributions, we get the exponent (L-299).

Problem 4b.i

( )

( )29999

29999

64

61

64

111gene 1least at predicting

64

61

64

11codons 100ORF no

−

−

−−≤

⇓

−≥≥

L

L

P

P

Finally, returning to our original question, which was the probability of falsely predicting at least one gene, it’s simply one minus the quantity we computed previously, giving us the upper bound in the problem statement.

Problem 4b.ii

[ ] ( )

5364

61

64

1112000

12000genes predicted false #

region in gene false apredict ,1

region in gene false no,0

29950099

2000

1

≈

−−≤

==

=

=

−

=∑ ii

i

i

NPNEE

i

iN

For the next problem you were asked to estimate the number of falsely predicted genes if we have 2000 noncoding regions, all of length 500 bp. Let’s define the variable N(i) to be the number of falsely predicted genes in region i. In the problem statement we said you could ignore the possibility of multiple genes being predicted in the same region, so N(i) becomes an indicator variable. The expectation of the number of falsely predicted genes is simply the expectation of the sum of these indicator variables, which can be calculated using the bound we showed earlier to be approximately 53.

Problem 4b.ii

%1.99536000%95

6000%95

predicted genes total

correctly predicted genes trueyspecificit

)(given! 95%ysensitivit

=+×

×=

=

=

The sensitivity of this method, or the number of true genes predicted correctly divided by the number of true genes, was given in the problem statement. It’s 95%. The specificity tells us how accurate our predictions were, and is at least 99.1%.

Problem 4b.iii

%99999.9964

61

64

111

:bp) (50,000region intergenic

%6.9364

61

64

111

:bp) (2,000intron

149000,5049

149200049

≈

−−

≈

−−

−

−

For the last part of the problem you were asked to apply ORF scanning to the human genome, which has different characteristics that makes the method fail. In particular, in a typical human intron of 2,000 bp, we expect to falsely predict a gene over 93% of the time. And for a typical intergenic region of 50,000 bp we will definitely falsely predict a gene.

cs262 problem set 4 - stanford university...problem 1b ag ac tc tg s1 s2 t-g- t--c tagc tagc -a-c...

Documents