cs262 problem set 4 - stanford university...problem 1b ag ac tc tg s1 s2 t-g- t--c tagc tagc -a-c...

32
CS262 Problem Set 4

Upload: others

Post on 21-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

CS262 Problem Set 4

Page 2: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 1

Page 3: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 1a

Sequences: A, A, T, T

A

A

T

T

-A

-A

T-

T-

A-

A-

-T

-T

Optimal alignments: vs.

Sum-of-pairs scores:

Consensus scores:

0 -4

2 4

Parameters: {+2, -1, -1}

>

<

Page 4: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 1b

Distinct Sequences: AC, AG, TC, TG Parameters: {+2, -1, -1}

AC

AG

TC

TG

-AC-

-A-G

T-C-

T--G

Optimal alignments: vs.

Sum-of-pairs scores:

Consensus scores:

0 -8

4 8

>

<

A-C-

A--G

-TC-

-T-G

-A-C

-AG-

T--C

T-G-

A--C

A-G-

-T-C

-TG-

Page 5: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 1b

AG

AC TC

TG

S1 S2

T-G-

T--C

TAGC

TAGC

-A-C

-AG-

Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC)

Different to Sum-of-Pairs alignment.

Phylogenetic score:

TC

TG

AC

AC

AC

AG

T--G

T-C-

TACG

TACG

-AC-

-A-G

-TG-

-T-C

ATGC

ATGC

A--C

A-G-

-T-G

-TC-

ATCG

ATCG

A-C-

A--G

S1:

S2:

24 24 24 24 8

Page 6: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 2

Page 7: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 2a

y z w

x 2 5 3

y 3 1

z 2

S(u, v)

x: TACCCGAT

y: TAAACGAT

z: AAAACGCG

w: AAAACGAT

Page 8: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 2a

y z w

x 2 5 3

y 3 1

z 2

S(u, v) Ultrametricity a, b, c: if S(a,b) = min[S(a,b), S(a,c), S(b,c)] then S(a,b) S(a,c) = S(b,c) S(x,y) = 2, S(x,z) = 5, S(y,z) = 3 so S(x,y) = min[S(x,y), S(x,z), S(y,z)] but S(x, z) = 5 3 = S(y, z) not ultrametric

Page 9: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 2a

UPGMA

y w 1/2 1/2

y z w

x 2 5 3

y 3 1

z 2

S(u, v)

Page 10: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 2a

UPGMA

y w 1/2 1/2

5/4 5/4

x

z yw

x 5 ½(2+3)

z ½(3+2)

S(u, v)

Page 11: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 2a

UPGMA

y w 1/2 1/2

5/4 5/4

x

5/3 5/3

z

xyw

z (1/3)(5+3+2)

S(u, v)

Page 12: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 2b

y w x z

x: TACCCGAT

y: TAAACGAT

z: AAAACGCG

w: AAAACGAT

Page 13: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 2b

T A T A

{A,T}

{T}

{A,T}

A A C A

{A}

{A,C}

{A}

A A A C

{A}

{A}

{A,C}

x: TACCCGAT

y: TAAACGAT

z: AAAACGCG

w: AAAACGAT

Total cost = 6 mutations

Page 14: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 2b

x y z w

x: TACCCGAT

y: TAAACGAT

z: AAAACGCG

w: AAAACGAT

Page 15: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 2b

T T A A

{T}

{A,T}

{A}

C A A A

{A,C}

{A}

{A}

A A C A

{A}

{A}

{A,C}

x: TACCCGAT

y: TAAACGAT

z: AAAACGCG

w: AAAACGAT

Total cost = 5 mutations

Page 16: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 2c

2 3 4

1 0.3 0.5 0.6

2 0.6 0.5

3 0.9

D(i,j) 1 2

3 4

0.1 0.1 0.1

0.4 0.4

Page 17: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 2c

Neighbor-joining r(i) = (kD(i,j))/(|L|-2) d(i,j) = D(i,j) - r(i) - r(j)

1 0.7

2 0.7

3 1.0

4 1.0

2 3 4

1 0.3 0.5 0.6

2 0.6 0.5

3 0.9

D(i,j)

2 3 4

1 -1.1 -1.2 -1.1

2 -1.1 -1.2

3 -1.1

d(i,j)

r(i)

Page 18: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 2c

2 3 4

1 -1.1 -1.2 -1.1

2 -1.1 -1.2

3 -1.1

d(i,j) 1 2

3 4

0.1 0.1 0.1

0.4 0.4

Page 19: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 3

(a)

d = −3

4ln

(

1−4

3p

)

For p = 0.05, d = 0.0517. For p = 0.2, d = 0.2326 > 4 × 0.0517 = 0.2068. d is the average number ofsubstitutions per site, while p is the probability that a given site will differ between the two genomes. Sinceit is possible that mutliple substitutions occur at the same site, d ≥ p. For small p, the difference between p

and d is small. As the evolutionary distance between the two genomes increases, i.e. as p increases, multiplesubstitutions at the same site become more likely, and the difference between p and d increases. This isintuitively why, when p increases 4 times, d increases more than 4 times.

(b)

r(tXY )r(tY Z) + 3s(tXY )s(tY Z) =1

4

1

4

(

1 + 3e−µtXY

) (

1 + 3e−µtY Z

)

+ 31

4

1

4

(

1− e−µtXY

) (

1− e−µtY Z

)

=1

16

(

1 + 3e−µtXY + 3e−µtY Z + 9e−µ(tXY +tY Z) + 3− 3e−µtXY − 3e−µtY Z + 3e−µ(tXY +tY Z))

=1

16

(

4 + 12e−µ(tXY +tY Z))

=1

4

(

1 + 3e−µ(tXY +tY Z))

= r(tXY + tY Z)

(c)

Using the fact that r = 1− p and r + 3s = 1:

1− pXZ = (1− pXY )(1− pY Z) + 3pXY

3

pY Z

3

= 1− pXY − pY Z + pXY pY Z +pXY pY Z

3

= 1−

(

pXY + pY Z −4

3pXY pY Z

)

(d)

dXY + dY Z = −3

4ln

(

1−4

3pXY

)

−3

4ln

(

1−4

3pY Z

)

= −3

4ln

(

1−4

3pXY

)(

1−4

3pY Z

)

= −3

4ln

(

1−4

3pXY −

4

3pY Z +

4

3pXY

4

3pY Z

)

= −3

4ln

(

1−4

3

(

pXY + pY Z −4

3pXY pY Z

))

= −3

4ln

(

1−4

3pXZ

)

1

Page 20: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 4a.i

start stop

n

( ) ( )

0.04687564/3

64/164/164/1

)TGA()TAG()TAA(

)TGATAGTAA(

0,1

0,01

==++=

++=∪∪=

>−=

== −

PPP

Ps

nss

nnLP n

For the first part of problem 4, you were asked to derive some ORF length distribution statistics and do some simple calculations on them. For the first part, suppose we have a sequence of completely random nucleotides and we select a random ORF from it. We asked you to show what the probability distribution of lengths for such an ORF is, where the length is the number of codons including the start codon but excluding the stop codon. This probability distribution is clearly 0 for length 0 (since we need at least 1 codon for an ORF). For any non-zero length, the probability is simply the product of the probability of having (n-1) codons that are all not a stop codon, followed by one stop codon. The variable s is the probability that a random codon is a stop codon.
Page 21: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 4a.ii

( )

( ) ( )

( )( ) 0.86265%11

111

111

)100(

99

98

0

99

1

100

≈−−−−

−=

−−==−=

==≥

∑∑

==

=

s

ss

ssnLP

nLPLP

k

k

n

n

Then, we asked you to compute the probability that the length of the random ORF is at least 100 codons. This is simply the sum of our previous probability distribution over lengths 100 to infinity, which is the same as 1 minus the sum of the distribution from 1 to 99. This is simply a geometric sum and can be computed as shown in this slide. As you can see, in random sequence, we’d expect to see an ORF of at least 100 codons very rarely.
Page 22: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 4a.iii

start stop

n

( ) ( )

032.0

)2.0)(3.0)(2.0()3.0)(2.0)(2.0()2.0)(2.0)(2.0(

)TGA()TAG()TAA(

)TGATAGTAA(

0,1

0,01

=++=

++=∪∪=

>−=

== −

PPP

Ps

nss

nnLP n

For problem 4(a)iii, we essentially have the same distribution, except that the probability of having a particular codon be a stop codon, given by s, is slightly lower.
Page 23: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 4a.iii

( )

( ) ( )

( )( ) 3.99632%11

111

111

)100(

99

98

0

99

1

100

≈−−−−

−=

−−==−=

==≥

∑∑

==

=

s

ss

ssnLP

nLPLP

k

k

n

n

With this change, the probability of seeing an ORF in a random sequence of length 100 codons or longer is increased to almost 4%.
Page 24: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 4b.i

( )( )

( )codons 100ORF no1

codons 100ORF 1least at

gene 1least at predicting

≥−=≥=

P

P

P

iOi position at starting codons 100 ORFan event thatLet ≥∃=

( )( )

∩∩∩∩∩∩∩∩∩∩∩

=

∩∩∩=

963

852

741

29921

codons 100ORF no

OOO

OOO

OOO

P

OOOP

P

L

For problem 4(b) you were asked to prove an upper bound on the probability of falsely predicting a gene in a noncoding region with random bases. First, to compute this probability, we first notice that this probability is 1 minus the probability that there are no falsely predicted genes, i.e. that there are no ORFs of length at least 100 codons. Now, let’s define Oi to be the event that there exists an ORF starting at position i that is at least 100 codons long. In other words, there is a start codon at position i, and for 99 codons thereafter there are no stop codons. “O” with a bar on top of it is the complement of the event. Next, let’s compute the probability that there are no ORFs of at least 100 codons. This is equal to the probability that there is no such ORF at position 1, nor position 2, and so forth until position (L-299), after which we could not possibly fit an ORF of length 100 codons. Let’s rearrange the terms within the probability so that we put the terms within the same frame together, such as O1, O4, O7, etc.
Page 25: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 4b.i

( ) ( ) ( ) ∩∩∩∩=∩∩∩ 74741741 OOPOOOPOOOP

( ) ( )

( ) ( )1741

1741 :Claim

OPOOOP

OPOOOP

≤∩∩

≥∩∩

Next, let’s split up this probability by using Bayes’ rule. We can pull out O1 as shown in this equation. Why would we want to do this? Well, my claim is that the probability of not having a long ORF at position 1 given that we know we don’t have long ORFs at positions 4, 7, etc is greater than or equal to the probability of not having a long ORF at position 1 given no extra information. Equivalently, the probability of having a long ORF at position given the extra information is less than or equal to the probability of having a long ORF at position 1 given no extra information.
Page 26: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 4b.i

( ) ( ) ( )( ) ( )1at codon start 1at codon start ,

1at codon start 1at codon start ,

741

741741

POOOP

POOOPOOOP

∩∩+

¬¬∩∩=∩∩

( ) ( )

( ) ( )1at codon start 1at codon start ,

1at codon start 01at codon start ,

1741

1741

OPOOOP

OPOOOP

≤∩∩

¬==¬∩∩

How can we show this? Well, let’s take this probability and split it up into two possibilities: either there is a start codon at position one or there isn’t one. Suppose there isn’t a start codon, well then we obviously don’t have a long ORF at position 1 so therefore the probability is 0 and isn’t influenced by the extra information. Alternatively, suppose there is a start codon at position 1. I claim that we are much less likely to have a long ORF given that positions 4, 7, and so forth don’t have long ORFs, because it is likely that those events will have failed because of a stop codon along the way. Therefore, the probability of having a long ORF at position 1 given the extra information is less than the probability without the extra condition.
Page 27: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 4b.i

( ) ( ) ( )( ) ( )

( ) ( )( ) ( )

( )1

1

1

741

741741

1at codon start 1at codon start

1at codon start 1at codon start

1at codon start 1at codon start ,

1at codon start 1at codon start ,

OP

POP

POP

POOOP

POOOPOOOP

=

+

¬¬≥

∩∩+

¬¬∩∩=∩∩

( ) ( )

( ) ( )1741

1741

OPOOOP

OPOOOP

≥∩∩

≤∩∩

Now, returning to the previous equation, we can see that the probability of having a long ORF at position 1 conditioned on not having long ORFs at positions 4, 7, etc is at least as much as the probability of having a long ORF at position 1 with no condition. This completes my earlier claim.
Page 28: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 4b.i

( ) ( ) ( )( ) ( )

( ) ( ) ( )29999

29921

741

74741741

64

61

64

11

−=

∩∩≥

∩∩∩∩=∩∩∩

L

LOPOPOP

OOPOP

OOPOOOPOOOP

( ) ( )1741 OPOOOP ≥∩∩

Now, given this inequality, we can rewrite our previous equation where we extracted O1 to become an inequality. We can continue this process for all the remaining codons. You may be wondering about inter-frame dependencies and how that affects this analysis, but don’t worry about it. We don’t expect you to have proved this part rigorously. For each particular event Oi, the probability of not having a long ORF there is one minus the probability of having a long ORF there, which is the probability of having a start codon followed by 99 non-stop codons. Since each of the events Oi have identical probability distributions, we get the exponent (L-299).
Page 29: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 4b.i

( )

( )29999

29999

64

61

64

111gene 1least at predicting

64

61

64

11codons 100ORF no

−−≤

−≥≥

L

L

P

P

Finally, returning to our original question, which was the probability of falsely predicting at least one gene, it’s simply one minus the quantity we computed previously, giving us the upper bound in the problem statement.
Page 30: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 4b.ii

[ ] ( )

5364

61

64

1112000

12000genes predicted false #

region in gene false apredict ,1

region in gene false no,0

29950099

2000

1

−−≤

==

=

=

=∑ ii

i

i

NPNEE

i

iN

For the next problem you were asked to estimate the number of falsely predicted genes if we have 2000 noncoding regions, all of length 500 bp. Let’s define the variable N(i) to be the number of falsely predicted genes in region i. In the problem statement we said you could ignore the possibility of multiple genes being predicted in the same region, so N(i) becomes an indicator variable. The expectation of the number of falsely predicted genes is simply the expectation of the sum of these indicator variables, which can be calculated using the bound we showed earlier to be approximately 53.
Page 31: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 4b.ii

%1.99536000%95

6000%95

predicted genes total

correctly predicted genes trueyspecificit

)(given! 95%ysensitivit

=+×

×=

=

=

The sensitivity of this method, or the number of true genes predicted correctly divided by the number of true genes, was given in the problem statement. It’s 95%. The specificity tells us how accurate our predictions were, and is at least 99.1%.
Page 32: CS262 Problem Set 4 - Stanford University...Problem 1b AG AC TC TG S1 S2 T-G- T--C TAGC TAGC -A-C -AG- Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC) Different

Problem 4b.iii

%99999.9964

61

64

111

:bp) (50,000region intergenic

%6.9364

61

64

111

:bp) (2,000intron

149000,5049

149200049

−−

−−

For the last part of the problem you were asked to apply ORF scanning to the human genome, which has different characteristics that makes the method fail. In particular, in a typical human intron of 2,000 bp, we expect to falsely predict a gene over 93% of the time. And for a typical intergenic region of 50,000 bp we will definitely falsely predict a gene.