cs262 problem set 4 - stanford university...problem 1b ag ac tc tg s1 s2 t-g- t--c tagc tagc -a-c...
TRANSCRIPT
CS262 Problem Set 4
Problem 1
Problem 1a
Sequences: A, A, T, T
A
A
T
T
-A
-A
T-
T-
A-
A-
-T
-T
Optimal alignments: vs.
Sum-of-pairs scores:
Consensus scores:
0 -4
2 4
Parameters: {+2, -1, -1}
>
<
Problem 1b
Distinct Sequences: AC, AG, TC, TG Parameters: {+2, -1, -1}
AC
AG
TC
TG
-AC-
-A-G
T-C-
T--G
Optimal alignments: vs.
Sum-of-pairs scores:
Consensus scores:
0 -8
4 8
>
<
A-C-
A--G
-TC-
-T-G
-A-C
-AG-
T--C
T-G-
A--C
A-G-
-T-C
-TG-
Problem 1b
AG
AC TC
TG
S1 S2
T-G-
T--C
TAGC
TAGC
-A-C
-AG-
Optimal phylogenetic alignment: S1 = S2 = ATCG (or ATGC/TACG/TAGC)
Different to Sum-of-Pairs alignment.
Phylogenetic score:
TC
TG
AC
AC
AC
AG
T--G
T-C-
TACG
TACG
-AC-
-A-G
-TG-
-T-C
ATGC
ATGC
A--C
A-G-
-T-G
-TC-
ATCG
ATCG
A-C-
A--G
S1:
S2:
24 24 24 24 8
Problem 2
Problem 2a
y z w
x 2 5 3
y 3 1
z 2
S(u, v)
x: TACCCGAT
y: TAAACGAT
z: AAAACGCG
w: AAAACGAT
Problem 2a
y z w
x 2 5 3
y 3 1
z 2
S(u, v) Ultrametricity a, b, c: if S(a,b) = min[S(a,b), S(a,c), S(b,c)] then S(a,b) S(a,c) = S(b,c) S(x,y) = 2, S(x,z) = 5, S(y,z) = 3 so S(x,y) = min[S(x,y), S(x,z), S(y,z)] but S(x, z) = 5 3 = S(y, z) not ultrametric
Problem 2a
UPGMA
y w 1/2 1/2
y z w
x 2 5 3
y 3 1
z 2
S(u, v)
Problem 2a
UPGMA
y w 1/2 1/2
5/4 5/4
x
z yw
x 5 ½(2+3)
z ½(3+2)
S(u, v)
Problem 2a
UPGMA
y w 1/2 1/2
5/4 5/4
x
5/3 5/3
z
xyw
z (1/3)(5+3+2)
S(u, v)
Problem 2b
y w x z
x: TACCCGAT
y: TAAACGAT
z: AAAACGCG
w: AAAACGAT
Problem 2b
T A T A
{A,T}
{T}
{A,T}
A A C A
{A}
{A,C}
{A}
A A A C
{A}
{A}
{A,C}
x: TACCCGAT
y: TAAACGAT
z: AAAACGCG
w: AAAACGAT
Total cost = 6 mutations
Problem 2b
x y z w
x: TACCCGAT
y: TAAACGAT
z: AAAACGCG
w: AAAACGAT
Problem 2b
T T A A
{T}
{A,T}
{A}
C A A A
{A,C}
{A}
{A}
A A C A
{A}
{A}
{A,C}
x: TACCCGAT
y: TAAACGAT
z: AAAACGCG
w: AAAACGAT
Total cost = 5 mutations
Problem 2c
2 3 4
1 0.3 0.5 0.6
2 0.6 0.5
3 0.9
D(i,j) 1 2
3 4
0.1 0.1 0.1
0.4 0.4
Problem 2c
Neighbor-joining r(i) = (kD(i,j))/(|L|-2) d(i,j) = D(i,j) - r(i) - r(j)
1 0.7
2 0.7
3 1.0
4 1.0
2 3 4
1 0.3 0.5 0.6
2 0.6 0.5
3 0.9
D(i,j)
2 3 4
1 -1.1 -1.2 -1.1
2 -1.1 -1.2
3 -1.1
d(i,j)
r(i)
Problem 2c
2 3 4
1 -1.1 -1.2 -1.1
2 -1.1 -1.2
3 -1.1
d(i,j) 1 2
3 4
0.1 0.1 0.1
0.4 0.4
Problem 3
(a)
d = −3
4ln
(
1−4
3p
)
For p = 0.05, d = 0.0517. For p = 0.2, d = 0.2326 > 4 × 0.0517 = 0.2068. d is the average number ofsubstitutions per site, while p is the probability that a given site will differ between the two genomes. Sinceit is possible that mutliple substitutions occur at the same site, d ≥ p. For small p, the difference between p
and d is small. As the evolutionary distance between the two genomes increases, i.e. as p increases, multiplesubstitutions at the same site become more likely, and the difference between p and d increases. This isintuitively why, when p increases 4 times, d increases more than 4 times.
(b)
r(tXY )r(tY Z) + 3s(tXY )s(tY Z) =1
4
1
4
(
1 + 3e−µtXY
) (
1 + 3e−µtY Z
)
+ 31
4
1
4
(
1− e−µtXY
) (
1− e−µtY Z
)
=1
16
(
1 + 3e−µtXY + 3e−µtY Z + 9e−µ(tXY +tY Z) + 3− 3e−µtXY − 3e−µtY Z + 3e−µ(tXY +tY Z))
=1
16
(
4 + 12e−µ(tXY +tY Z))
=1
4
(
1 + 3e−µ(tXY +tY Z))
= r(tXY + tY Z)
(c)
Using the fact that r = 1− p and r + 3s = 1:
1− pXZ = (1− pXY )(1− pY Z) + 3pXY
3
pY Z
3
= 1− pXY − pY Z + pXY pY Z +pXY pY Z
3
= 1−
(
pXY + pY Z −4
3pXY pY Z
)
(d)
dXY + dY Z = −3
4ln
(
1−4
3pXY
)
−3
4ln
(
1−4
3pY Z
)
= −3
4ln
(
1−4
3pXY
)(
1−4
3pY Z
)
= −3
4ln
(
1−4
3pXY −
4
3pY Z +
4
3pXY
4
3pY Z
)
= −3
4ln
(
1−4
3
(
pXY + pY Z −4
3pXY pY Z
))
= −3
4ln
(
1−4
3pXZ
)
1
Problem 4a.i
start stop
n
( ) ( )
0.04687564/3
64/164/164/1
)TGA()TAG()TAA(
)TGATAGTAA(
0,1
0,01
==++=
++=∪∪=
>−=
== −
PPP
Ps
nss
nnLP n
Problem 4a.ii
( )
( ) ( )
( )( ) 0.86265%11
111
111
)100(
99
98
0
99
1
100
≈−−−−
−=
−−==−=
==≥
∑∑
∑
==
∞
=
s
ss
ssnLP
nLPLP
k
k
n
n
Problem 4a.iii
start stop
n
( ) ( )
032.0
)2.0)(3.0)(2.0()3.0)(2.0)(2.0()2.0)(2.0)(2.0(
)TGA()TAG()TAA(
)TGATAGTAA(
0,1
0,01
=++=
++=∪∪=
>−=
== −
PPP
Ps
nss
nnLP n
Problem 4a.iii
( )
( ) ( )
( )( ) 3.99632%11
111
111
)100(
99
98
0
99
1
100
≈−−−−
−=
−−==−=
==≥
∑∑
∑
==
∞
=
s
ss
ssnLP
nLPLP
k
k
n
n
Problem 4b.i
( )( )
( )codons 100ORF no1
codons 100ORF 1least at
gene 1least at predicting
≥−=≥=
P
P
P
iOi position at starting codons 100 ORFan event thatLet ≥∃=
( )( )
∩∩∩∩∩∩∩∩∩∩∩
=
∩∩∩=
≥
−
963
852
741
29921
codons 100ORF no
OOO
OOO
OOO
P
OOOP
P
L
Problem 4b.i
( ) ( ) ( ) ∩∩∩∩=∩∩∩ 74741741 OOPOOOPOOOP
( ) ( )
( ) ( )1741
1741 :Claim
OPOOOP
OPOOOP
≤∩∩
≥∩∩
Problem 4b.i
( ) ( ) ( )( ) ( )1at codon start 1at codon start ,
1at codon start 1at codon start ,
741
741741
POOOP
POOOPOOOP
∩∩+
¬¬∩∩=∩∩
( ) ( )
( ) ( )1at codon start 1at codon start ,
1at codon start 01at codon start ,
1741
1741
OPOOOP
OPOOOP
≤∩∩
¬==¬∩∩
Problem 4b.i
( ) ( ) ( )( ) ( )
( ) ( )( ) ( )
( )1
1
1
741
741741
1at codon start 1at codon start
1at codon start 1at codon start
1at codon start 1at codon start ,
1at codon start 1at codon start ,
OP
POP
POP
POOOP
POOOPOOOP
=
+
¬¬≥
∩∩+
¬¬∩∩=∩∩
( ) ( )
( ) ( )1741
1741
OPOOOP
OPOOOP
≥∩∩
≤∩∩
Problem 4b.i
( ) ( ) ( )( ) ( )
( ) ( ) ( )29999
29921
741
74741741
64
61
64
11
−
−
−=
≥
⇓
∩∩≥
∩∩∩∩=∩∩∩
L
LOPOPOP
OOPOP
OOPOOOPOOOP
( ) ( )1741 OPOOOP ≥∩∩
Problem 4b.i
( )
( )29999
29999
64
61
64
111gene 1least at predicting
64
61
64
11codons 100ORF no
−
−
−−≤
⇓
−≥≥
L
L
P
P
Problem 4b.ii
[ ] ( )
5364
61
64
1112000
12000genes predicted false #
region in gene false apredict ,1
region in gene false no,0
29950099
2000
1
≈
−−≤
==
=
=
−
=∑ ii
i
i
NPNEE
i
iN
Problem 4b.ii
%1.99536000%95
6000%95
predicted genes total
correctly predicted genes trueyspecificit
)(given! 95%ysensitivit
=+×
×=
=
=
Problem 4b.iii
%99999.9964
61
64
111
:bp) (50,000region intergenic
%6.9364
61
64
111
:bp) (2,000intron
149000,5049
149200049
≈
−−
≈
−−
−
−