computational issues on statistical genetics
DESCRIPTION
Computational Issues on Statistical Genetics. Research Question s Review the Literature. Test the power and robustness by c omputer simulation. Develop Method s. Database construction (Excel, Access) Translate d ata to analyzable form Preliminary results (figures, tables). - PowerPoint PPT PresentationTRANSCRIPT
Computational Issues on Statistical Genetics
Develop Methods
Data Collection
Analyze Data
Write Reports/Papers
Research Questions Review the Literature
Test the power and robustness by computer simulation
Database construction (Excel, Access)
Translate data to analyzable form
Preliminary results (figures, tables)
Program languages
Efficient, feasible
Graphics
Excel graphics
Programmable graphics
Program Languages
• Fortran, C, C++ • Matrix language: MATLAB, S-Plus, R, SAS IML • Symbolic Calculation: Mathematika,Maple,Matlab• Interface Programming: dotnet, C#, Visual Basic • SAS, SPSS, BMDP• Database: Access, Excel, SQL, SAS, Oracle• MACRO
– Excel, Access, PowerPoint, Word– Editor: WinEdt– SAS Macro
Two Point Analysis in F2Fully Informative Markers (codominant)
BB Bb bb
AA Obs n22 n21 n20
Freq ¼(1-r)2 ½r(1-r) ¼r2
Recom. 0 1 2
Aa Obs n12 n11 n10
Freq ½r(1-r) ½(1-r)2+½r2 ½r(1-r)
Recom. 1 2r2/[(1-r)2+r2] 1
aa Obs n02 n01 n00
Freq ¼r2 ½r(1-r) ¼(1-r)2
Recom. 2 1 0
EM algorithm to estimate the recombination fraction r:
1. Given r(0), For t=0,1, 2,…2. Do While abs[r(t+1)-r(t)]>1.e-8
E-step: Calculate (t) = r(t)2/[(1-r(t))2+r(t)2] (expected the number of recombination events for the double heterozygote AaBb)
M-step: r(t+1)= 1/(2n)[2(n20+n02)+(n21+n12+n10+n01)+2(t)n11]
Two Point Analysis in F2Fully Informative Markers (codominant)
AA
Aa
aa
BB Bb bb
n Start
Input: Result:
Resetr0
(t) = r(t)2/[(1-r(t))2+r(t)2]
r(t+1)= 1/(2n)[2(n20+n02)+(n21+n12+n10+n01)+2(t)n11]
Two Point Analysis in F2Fully Informative Markers (codominant)
function r=rEstF2(n22,n21,n20,n12,n11,n10,n02,n01,n00)
n=n22+n21+n20+n12+n11+n10+n02+n01+n00;
r=0.2; r1=-1;
while (abs(r1-r)>1.e-8)
r1=r;
%E-step
phi=r^2/((1-r)^2+r^2);
%M step
r=1/(2*n)*(2*(n20+n02)+(n21+n12+n10+n01)+2*phi*n11);
end
Matlab program to estimate recombinant r
Log-likelihood ratio test statistic
Two alternative hypothesesH0: r = 0.5 vs. H1: r 0.5
Likelihood value under H1L1(r|nij) = n!/(n22!...n00!) [¼(1-r)2]n22+n00[¼r2]n20+n02[½r(1-r)]n21+n12+n10+n01[½(1-r)2+½r2]n11
Likelihood value under H0L0(r=0.5|nij) = n!/(n22!...n00!) [¼(1-0.5)2]n22+n00[¼0.52]n20+n02[½0.5(1-0.5)]n21+n12+n10+n01[½(1-0.5)2+½0.52]n11
LOD = log10[L1(r|nij)/L0(r=0.5|nij)]
= {(n22+n00)2[log10(1-r)-log10(1-0.5)+…} = 6.08 > critical LOD=3
Two Point Analysis in F2Fully Informative Markers (codominant)
function LOD=calcLOD_F2(r,n22,n21,n20,n12,n11,n10,n02,n01,n00)
%%log likelihood under H1
LOD=(n22+n00)*log10((1-r)^2/4)...
+(n20+n02)*log10(r^2/4)...
+(n21+n12+n10+n01)*log10(r*(1-r)/2)...
+n11*log10((1-r)^2/2+r^2/2);
%%log likelihood under H0
r=0.5;
LOD0=(n22+n00)*log10((1-r)^2/4)...
+(n20+n02)*log10(r^2/4)...
+(n21+n12+n10+n01)*log10(r*(1-r)/2)...
+n11*log10((1-r)^2/2+r^2/2);
LOD=LOD-LOD0;
Matlab program to calculate log likelihood test score (LOD)
Two Point Analysis in F2Partial Informative Markers (codominant X dominant)
BB Bb bb
AA Obs n22 n21 n20
Freq ¼(1-r)2 ½r(1-r) ¼r2
Recom. 0 1 2
Aa Obs n12 n11 n10
Freq ½r(1-r) ½(1-r)2+½r2 ½r(1-r)
Recom. 1 2r2/[(1-r)2+r2] 1
aa Obs n02 n01 n00
Freq ¼r2 ½r(1-r) ¼(1-r)2
Recom. 2 1 0
Two Point Analysis in F2Partial Informative Markers (codominant X dominant)
B_ bb
AA Obs n2_ =n22+n21 n20
Freq ¼(1-r)2+ ½r(1-r) ¼r2
Recom. C1= ½r(1-r)/[¼(1-r)2+ ½r(1-r)] 2
Aa Obs n1_ =n12+n11 n10
Freq ½r(1-r)+½(1-r)2+½r2 ½r(1-r)
Recom. C2=[½r(1-r) +r2]/ [½r(1-r)+½(1-r)2+½r2] 1
aa Obs n0_ =n02+n01 n00
Freq ¼r2+½r(1-r) ¼(1-r)2
Recom. C3=[2* ¼r2+½r(1-r)]/[¼r2+½r(1-r)] 0Estimate of r=(c1* n2_ +c2* n1_ +c3* n0_+2* n20 + n00)/(2n)
Two Point Analysis in F2 Partial Informative Markers (codominant X dominant)
E-Step
C1= ½r(1-r)/[¼(1-r)2+ ½r(1-r)]
C2=[½r(1-r) +r2]/ [½r(1-r)+½(1-r)2+½r2]
C3=[2* ¼r2+½r(1-r)]/[¼r2+½r(1-r)]
M-Step
r=(c1* n2_ +c2* n1_ +c3* n0_+2* n20 + n00)/(2n)
Two Point Analysis in F2 Partial Informative Markers (codominant X dominant)
AA
Aa
aa
B_ bb
n Start
Input: Result:
Resetr0
Two Point Analysis in F2Partial Informative Markers (co dominant X dominant)
function r=rEstF2CoXdomin(n2_,n1_,n0_,n20,n10,n00)
n=n2_+n1_+n0_+n20+n10+n00;
r=0.2;r1=-1;
while(abs(r1-r)>1.e-8)
r1=r;
%E-step
c1= 1/2*r*(1-r)/[1/4*(1-r)^2+ 1/2*r*(1-r)];
c2=[1/2*r*(1-r)+r^2]/[1/2*r*(1-r)+1/2*(1-r)^2+1/2*r^2];
c3=[2*1/4*r^2+1/2*r*(1-r)]/[1/4*r^2+1/2*r*(1-r)];
%M-step
r=(c1*n2_+c2* n1_ +c3* n0_+2* n20 + n00)/(2*n);
end
Matlab program to estimate recombinant r
Two Point Analysis in F2 Partial Informative Markers (co dominant X dominant)
Matlab program to calculate log likelihood test score (LOD)
function LOD=calcLOD_F2CoXdomin(r, n2_,n1_,n0_,n20,n10,n00)%%log likelihood under H1LOD=log([1/4*(1-r)^2+ 1/2*r*(1-r)])*n2_ ... +log([1/2*r*(1-r)+1/2*(1-r)^2+1/2*r^2])*n1_ ... +log([1/4*r^2+1/2*r*(1-r)])*n0_ ... +log(r^2/4)*n20+log(r*(1-r)/2)*n10+log((1-r)^2/4)*n00;%%log likelihood under H0r=0.5;LOD0=log([1/4*(1-r)^2+ 1/2*r*(1-r)])*n2_ ... +log([1/2*r*(1-r)+1/2*(1-r)^2+1/2*r^2])*n1_ ... +log([1/4*r^2+1/2*r*(1-r)])*n0_ ... +log(r^2/4)*n20+log(r*(1-r)/2)*n10+log((1-r)^2/4)*n00;LOD=LOD-LOD0;LOD=LOD/log(10);
Two Point Analysis in F2Partial Informative Markers (dominant)
BB Bb bb
AA Obs n22 n21 n20
Freq ¼(1-r)2 ½r(1-r) ¼r2
Recom. 0 1 2
Aa Obs n12 n11 n10
Freq ½r(1-r) ½(1-r)2+½r2 ½r(1-r)
Recom. 1 2r2/[(1-r)2+r2] 1
aa Obs n02 n01 n00
Freq ¼r2 ½r(1-r) ¼(1-r)2
Recom. 2 1 0
Two Point Analysis in F2Partial Informative Markers (dominant)
B_ bb
A_ Obs n1 =n22+n21 +n12 + n11 n2=n20 +n10
Freq ¼(1-r)2 +r(1-r) + ½(1-r)2+½r2 ¼r2
Recom. c1 c2
aa Obs n3=n02+n01 n4= n00
Freq ¼r2 +½r(1-r) ¼(1-r)2
Recom. C2= (2(¼r2 )+½r(1-r)) 0 /(¼r2 +½r(1-r))
where C1=[r2+r(1-r)]/[ ¼(1-r)2 +r(1-r) + ½(1-r)2+½r2], expected number of recombinant gametesEstimate of r=(c1* n1 +c2* n2 +c2* n3)/(2n)
Two Point Analysis in F2Fully Informative Markers (codominant)
A_
aa
B_ bb
n Start
Input: Result:
Resetr0
C1=[r2+r(1-r)]/[ ¼(1-r)2 +r(1-r) + ½(1-r)2+½r2],
C2= (2(¼r2 )+½r(1-r)) /(¼r2 +½r(1-r)) Estimate of r=(c1* n1 +c2* n2 +c2* n3)/(2n)
Two Point Analysis in F2Partial Informative Markers (dominant)
function r=rEstF2Partial(n1,n2,n3,n4)
n=n1+n2+n3+n4;
r=0.2;r1=-1;
while (abs(r1-r)>1.e-8)
r1=r;
%E-step
c1=(r^2+r*(1-r))/((1-r)^2/4+r*(1-r)+(1-r)^2/2+r^2/2);
c2=(r^2/2+r*(1-r)/2)/(r^2/4+r*(1-r)/2);
%M-step
r=1/(2*n)*(c1*n1+c2*n2+c2*n3);
end
Matlab program to estimate recombinant r
Log-likelihood ratio test statistic Partial Informative Markers (dominant)
Two alternative hypotheses
H0: r = 0.5 vs. H1: r 0.5
Likelihood value under H1L1(r|nij) = n!/(n1!...n4!)
[3/4(1-r)2 +r(1-r) +½r2 ]n1[¼r2 +½r(1-r)]n2+n3[¼(1-r)2]n4
Likelihood value under H0L0(r=0.5|nij) = n!/(n1!...n4!)
[3/4(1-.5)2 +.5(1-.5) +½.52 ]n1[¼.52 +½.5(1-.5)]n2+n3[¼(1-.5)2]n4
LOD = log10[L1(r|nij)/L0(r=0.5|nij)]
= 3.17 > critical LOD=3
Two Point Analysis in F2 Partial Informative Markers (dominant)
function LOD=calcLOD_F2Partial(r,n1,n2,n3,n4)
%%log likelihood under H1
LOD=(n1)*log10((1-r)^2*3/4+r^2/2+r*(1-r))...
+(n2+n3)*log10(r^2/4+r*(1-r)/2)...
+(n4)*log10((1-r)^2/4);
%%log likelihood under H0
r=0.5;
LOD0=(n1)*log10((1-r)^2*3/4+r^2/2+r*(1-r))...
+(n2+n3)*log10(r^2/4+r*(1-r)/2)...
+(n4)*log10((1-r)^2/4);
LOD=LOD-LOD0;
Matlab program to calculate log likelihood test score (LOD)
Three Point Analysis in Backcrossa rice data
RG472
RG24619.2
16.1K5U10RG532
W1RG173
RZ276
Amy1B
RG146
RG345
RG381
RZ19
RG690
RZ730
RZ801
RG810
RG331
4.84.7
15.315.5
15.03.8
3.3
34.3
2.5
23.5
8.2
13.2
33.1
2.6
9.2
RG437
RG544
RG171
RG157
RZ318
Pall
RZ58
CDO686
Amy1A/C
RG95
RG654
RG256
RZ213
RZ123
RG520
13.0
5.3
22.2
27.4
6.3
29.3
10.2
8.8
12.8
8.4
5.110.0
5.4
13.1
RG104RG348
RZ329RZ892
RG100
RG191RZ678
RZ574
RZ284
RZ394
pRD10A
RZ403
RG179
CDO337
RZ337A
RZ448
RZ519
Pgi -1
CDO87
RG910
RG418A
7.7
13.26.99.82.8
17.5
41.6
37.1
15.6
18.5
2.5
5.028.6
1.9
22.5
15.0
32.1
7.1
9.217.9
RG218
RZ262
RG190
RG908RG91RG449
RG788RZ565
RZ675
RG163
RZ590
RG214
RG143
RG620
8.18.6
12.6
13.73.2
16.18.4
16.8
21.4
28.2
2.7
12.2
5.9
chrom1 chrom2 chrom3 chrom4
Three Point Analysis in BackcrossSummarized the data as
A,B,C A,B,C Obs. A & B B & C
111 abc nabc 0 0
112 abC nabC 0 1
121 aBc naBc 1 1
122 aBC naBC 1 0
211 Abc nAbc 1 0
212 AbC nAbC 1 1
221 ABc nABc 0 1
222 ABC nABC 0 0
Rice Data
A,B,C A,B,C Obs. A & B B & C
111 abc nabc =31 0 0
112 abC nabC =10 0 1
121 aBc naBc = 1 1 1
122 aBC naBC =11 1 0
211 Abc nAbc = 5 1 0
212 AbC nAbC = 2 1 1
221 ABc nABc = 2 0 1
222 ABC nABC =38 0 0
Marker RG472 denoted by A, RG246 by B, K5 by C
Multilocus likelihood – determination of a most likely gene order
• Consider three markers A, B, C, with no particular order assumed.• A triply heterozygous F1 ABC/abc backcrossed to a pure parent abc/abc
Genotype ABC or abc ABc or abC Abc or aBC AbC or aBcObs. n00 =69 n01=12 n10=16 n11=3
Frequency under Order A-B-C (1-rAB)(1- rBC) (1-rAB) rBC rAB(1- rBC) rAB rBC
Order A-C-B (1-rAC)(1- rBC) rAC rBC rAC(1-rBC) (1-rAC)rBC
Order B-A-C (1-rAB)(1- rAC) (1-rAB) rAC rABrAC rAB(1-rAC)
rAB = the recombination fraction between A and B= (n10 + n11)/n=0.19rBC = the recombination fraction between B and C=(n01 + n11)/n=0.15rAC = the recombination fraction between A and C=(n01 + n10)/n=0.28
What order is the mostly likely?
LABC (1-rAB)n00+n01 (1-rBC)n00+n10 (rAB)n10+n11 (rBC)n01+n11
LACB (1-rAC)n00+n11 (1-rBC)n00+n10 (rAC)n01+n10 (rBC)n01+n11
LBAC (1-rAB)n00+n01 (1-rAC)n00+n11 (rAB)n10+n11 (rAC)n01+n10
Log(LABC) = -90.8932Loo(LACB) = -101.5662Log(LBAC) = -107.9176
According to the maximum likelihood principle, the linkage order that gives the maximum likelihood for a data set is the best linkage order supported by the data.
the best linkage order A B C 20cM 15cM
Genotype ABC or abc ABc or abC Abc or aBC AbC or aBc
Obs. n00 =69 n01=12 n10=16 n11=3
DATA
Result:
rAB = =0.19
rBC = =0.15
rAC = =0.28
dAB =1/4*ln[(1+2 rAB)/(1-2 rAB)]=20
dBC =1/4*ln[(1+2 rBC)/(1-2 rBC)]=15
Log(LABC) = -90.8932
Loo(LACB) = -101.5662
Log(LBAC) = -107.9176
the best linkage order A B C 20cM 15cM