basic model for genetic linkage analysis lecture #5 prepared by dan geiger

.

Basic Model For Genetic Linkage Analysis

Lecture #5

Prepared by Dan Geiger

2

Using the Maximum Likelihood Approach

The probability of pedigree data Pr(data | ) is a function of the known and unknown recombination fractions denoted collectively by .

How can we construct this likelihood function ?

The maximum likelihood approach is to seek the value of which maximizes the likelihood function Pr(data | ) . This is the ML estimate.

3

Constructing the Likelihood function

Lijm = Maternal allele at locus i of person j. The values of this variables are the possible alleles li at locus i.

First, we need to determine the variables that describe the problem. There are many possible choices. Some variables we can observe and some we cannot.

Xij = Unordered allele pair at locus i of person j. The values are pairs of ith-locus alleles (li,l’i).

Lijf = Paternal allele at locus i of person j. The values of this variables are the possible alleles li at locus i (Same as for Lijm) .

As a starting point, We assume that the data consists of an assignment to a subset of the variables {Xij}. In other words some (or all) persons are genotyped at some (or all) loci.

4

What is the relationships among the variables for a specific individual ?

L11fL11m

X11

Paternal allele at locus 1 of person 1

Unordered allele pair at locus 1 of person 1 = data

Maternal allele at locus 1 of person 1

P(L11m = a) is the frequency of allele a. We use lower case letters for states writing, in short, P(l11m).

P(x11 | l11m, l11f) = 0 or 1 depending on consistency

5

What is the relationships among the variables across individuals ?

L11fL11m

L13m

X11

P(l13m | l11m, l11f) = 1/2 if l13m = l11m or l13m = l11f

P(l13m | l11m, l11f) = 0 otherwise

L12fL12m

L13f

X12

X12

First attempt: correct but not efficient as we shall see.

Mother Father

Offspring

6

Probabilistic model for two lociL11fL11m

L13m

X11

L12fL12m

L13f

X12

X13 Model for locus 1

L21fL21m

L23m

X21

L22fL22m

L23f

X22

X23Model for locus 2

L23m depends on whether L13m got the value from L11m or L11f, whether a recombination occurred,and on the values of L21m and L21f. This is quite complex.

7

Adding a selector variableL11fL11m

L13m

X11

S13m

Selector of maternal allele at locus 1 of person 3

Maternal allele at locus 1 of person 3 (offspring)

Selector variables Sijm are 0 or 1 depending on whose allele is transmitted to offspring i at maternal locus j.

P(s13m) = ½

P(l13m | l11m, l11f,,S13m=0) = 1 if l13m = l11m

P(l13m | l11m, l11f,,S13m=1) = 1 if l13m = l11f

P(l13m | l11m, l11f,,s13m) = 0 otherwise

8

Probabilistic model for two lociS13m

L11fL11m

L13m

X11 S13f

L12fL12m

L13f

X12

X13Model for locus 1

S23m

L21fL21m

L23m

X21 S23f

L22fL22m

L23f

X22

X23

Model for locus 2

9

Probabilistic Model for Recombination

S23m

L21fL21m

L23m

X21 S23f

L22fL22m

L23f

X22

X23

S13m

L11fL11m

L13m

X11 S13f

L12fL12m

L13f

X12

X13

{m,f}tssP tt

where

1

1),|( 1323

is the recombination fraction between loci 2 & 1.

10

Constructing the likelihood function I

S13m

L11fL11m

L13m

X11

P(l11m, l11f,, x11, s13m,l13m) = P(l11m) P(l11f) P(x11 | l11m, l11f,) P(s13m) P(l13m | s13m, l11m, l11f)

Joint probability

Prob(data) = P(x11) =

l11m l11f s13m l13m P(l11m, l11f,, x11, s13m,l13m)

Probability of data (sum over all states of all hidden variables)

All other variables are not-observed (hidden)

Observed variable

11

Constructing the likelihood function II

= P(l11m) P(l11f) P(x11 | l11m, l11f,) … P(s13m) P(s13f) P(s23m | s13m, 2) P(s23m | s13m, 2)

P(l11m,l11f,x11,l12m,l12f,x12,l13m,l13f,x13, l21m,l21f,x21,l22m,l22f,x22,l23m,l23f,x23,

s13m,s13f,s23m,s23f, 2) = Product over all local probability tables

Prob(data| 2) = P(x11 ,x12 ,x13 ,x21 ,x22 ,x23) =

Probability of data (sum over all states of all hidden variables)

Prob(data| 2) = P(x11 ,x12 ,x13 ,x21 ,x22 ,x23) = l11m, l11f … s23f [P(l11m)

P(l11f) P(x11 | l11m, l11f,) … P(s13m) P(s13f) P(s23m | s13m, 2) P(s23m | s13m, 2) ]

The result is a function of the recombination fraction. The ML estimate is the 2 value that maximizes this function.

12

Modeling Phenotypes IL11fL11m

L13m

X11

S13m

Phenotype variables Yij are 0 or 1 depending on whether a phenotypic trait associated with locus i of person j is observed. E.g., sick versus healthy. For example model of perfect recessive disease yields the penetrance probabilities:P(y11 = sick | X11= (a,a)) = 1

P(y11 = sick | X11= (A,a)) = 0P(y11 = sick | X11= (A,A)) = 0

Y11

13

Modeling Phenotypes IIL11fL11m

L13m

X11

S13m

Note that in this model we assume the phenotype depends only on the alleles of one locus. Also we did not model levels of sickness. We did not model continuous phenotypic observations either.

Y11

14

Introducing a tentative disease Locus

S23m

L21fL21m

L23m

X21 S23f

L22fL22m

L23f

X22

X23

S13m

L11fL11m

L13m

X11 S13f

L12fL12m

L13f

X12

X13

1

1),|( '1323 tt ssP

The recombination fraction 2 is unknown. Finding it can help determine whether a gene causing the disease lies in the vicinity of the marker locus.

Disease locus: assumesick means xij=(a,a)

Marker locus

15

Locus-by-Locus Summation order

Sum over locus i vars before summing over locus i+1 varsSum over orange vars (Lijt) before summing selector vars (Sijt).

This order yields a Hidden Markov Model (HMM).

Si3

m

Li1

fL

i1m

Li3

m

Xi1

Si3

f

Li2

fL

i2m

Li3

f

Xi2

Xi3

1 2 3 4

16

Hidden Markov Models in General

Application in communication: message sent is (s1,…,sm) but we receive (r1,…,rm) . Compute what is the most likely message sent ?

Application in speech recognition: word said is (s1,…,sm) but we recorded (r1,…,rm) . Compute what is the most likely word said ?

Application in Genetic linkage analysis: to be discussed now.

X1 X2 X3 Xi-1 Xi Xi+1R1 R2 R3 Ri-1 Ri Ri+1

X1 X2 X3 Xi-1 Xi Xi+1S1 S2 S3 Si-1 Si Si+1

),...,( 1

11

1

)|()|()(),,(Lss

ii

L

iiiiL srpsspsprrp

Which depicts the factorization:

17

Hidden Markov Model In our case

X1 X2 X3 Xi-1 Xi Xi+1X1 X2 X3 Xi-1 Xi Xi+1

X1 X2 X3 Xi-1 Xi Xi+1S1 S2 S3 Si-1 Si Si+1

The compounded variable Si = (Si,1,m,…,Si,2n,f) is called the inheritance vector. It has 22n states where n is the number of persons that have parents in the pedigree (non-founders). The compounded variable Xi = (Xi,1,m,…,Xi,2n,f) is the data regarding locus i.

To specify the HMM we need to write down the transition matrices from Si-1 to Si and the matrices P(xi|Si). Note that these quantities have already been implicitly defined.

18

The transition matrix

{m,f}tMssPtt

tttttt

where

1

1)(),|( 1323

Recall that:

Therefore, in our example, where we have one non-founder (n=1), the transition probability table size is 4 4 = 22n 22n, encoding four options of recombination/non-recombination for the two parental meiosis:

mm

mmf

mm

mmf

mm

mmf

mm

mmf

fmfmfm ssssP

1

11

1

1

1

1

1

11

),,,|,( 13132323

)()( fm MM (The Kronecker product) fmhen :w

)()()(... n times ...)()( 2 MMMMM nfmfm

For n non-founders, the transition matrix is the n-fold Kronecker product:

19

Probability of data in one locus given an inheritance vector

S23m

L21fL21m

L23m

X21 S23f

L22fL22m

L23f

X22

X23

Model for locus 2

P(x21, x22 , x23 |s23m,s23f) =

= P(l21m) P(l21f) P(l22m) P(l22f) P(x21 | l21m, l21f)

P(x22 | l22m, l22f) P(x23 | l23m, l23f) P(l23m | l21m, l21f, S23m) P(l23f | l22m, l22f, S23f)

l21m,l21f,l22m,l22f l22m,l22f

The five last terms are always zero-or-one, namely, indicator functions.

20

Posterior decoding

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

The standard query for HMM is belief update (also called posterior decoding).

1. Compute the posteriori belief in Hi (specific i) given the evidence {x1,…,xL} for each of Hi’s values hi, namely, compute p(hi | x1,…,xL).

2. Do the same computation for every Hi but without repeating the first task L times.

The solution is called the forward-backward algorithm.

21

Likelihood of evidence

To compute the likelihood of evidence P(x1,…,xL), which depends on the recombination fractions in our case, we will use either the forward or the backward algorithm to be described now.

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

22

Decomposing the computation

P(x1,…,xL,hi) = P(x1,…,xi,hi) P(xi+1,…,xL | x1,…,xi,hi)

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

Equality due to Ind({xi+1,…,xL}, {x1,…,xi} | Hi}

= P(x1,…,xi,hi) P(xi+1,…,xL | hi) f(hi) b(hi)

Answer: P(hi | x1,…,xL) = (1/K) P(x1,…,xL,hi)

where K= hi P(x1,…,xL,hi).

23

The forward algorithm

P(x1,x2,h2) = P(x1,h1,h2,x2) {Second step}

= P(x1,h1) P(h2 | x1,h1) P(x2 | x1,h1,h2)

h1

h1Last equality due to conditional independence = P(x1,h1) P(h2 | h1) P(x2 | h2)h1

H1 H2

X1 X2

Hi

Xi

The task: Compute f(hi) = P(x1,…,xi,hi) for i=1,…,L (namely, considering evidence up to time slot i).

P(x1, h1) = P(h1) P(x1|h1) {Basis step}

P(x1,…,xi,hi) = P(x1,…,xi-1, hi-1) P(hi | hi-1 ) P(xi | hi)hi-1

{step i}

24

The backward algorithm

The task: Compute b(hi) = P(xi+1,…,xL|hi) for i=L-1,…,1 (namely, considering evidence after time slot i).

HL-1 HL

XL-1 XL

Hi Hi+1

Xi+1

P(xL| hL-1) = P(xL ,hL |hL-1) = P(hL |hL-1) P(xL |hL-1 ,hL )= hL hL

Last equality due to conditional independence

= P(hL |hL-1) P(xL |hL ) {first step}

hL

P(xi+1,…,xL|hi) = P(hi+1 | hi) P(xi+1 | hi+1) P(xi+2,…,xL| hi+1)

hi+1

{step i}

=b(hi)= =b(hi+1)=

25

The combined answer

1. To Compute the posteriori belief in Hi (specific i) given the evidence {x1,…,xL} run the forward algorithm and compute f(hi) = P(x1,…,xi,hi), run the backward algorithm to compute b(hi) = P(xi+1,…,xL|hi), the product f(hi)b(hi) is the answer (for every possible value hi).

2. To Compute the posteriori belief for every Hi

simply run the forward and backward algorithms once, storing f(hi) and b(hi) for every i (and value hi). Compute f(hi)b(hi) for every i.

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

26

Likelihood of evidence revisited

1. To compute the likelihood of evidence P(x1,…,xL), do one more step in the forward algorithm, namely,

f(hL) = P(x1,…,xL,hL)

2. Alternatively, do one more step in the backward algorithm, namely,

b(h1) P(h1) P(x1|h1) = P(x2,…,xL|h1) P(h1) P(x1|h1)

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

hL

h1

hL

h1

27

Time and Space Complexity of the forward/backward algorithms

Time complexity is linear in the length of the chain, provided the number of states of each variable is a constant. More precisely, time complexity is O(k2n) where k is the maximum domain size of each variable.

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

Space complexity is also O(k2n).In our case: O(k2n) is really O(24nL). Next class we will see how to save on these computations using the special matrices we have. The savings have been implemented in GeneHunter – a leading software of linkage.

28

The Maximum APosteriori queryH1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

1. Recall that the query asking likelihood of

evidence is to compute P(x1,…,xL) = P(x1,…,xL, h1,…,hL)

2. Now we wish to compute a similar quantity:

P*(x1,…,xL) = MAX P(x1,…,xL, h1,…,hL)(h1,…,hL)

(h1,…,hL)

And, of course, we wish to find a MAP assignment (h1

*,…,hL*) that brought about this maximum.

29

Example: Revisiting likelihood of evidence

H1 H2

X1 X2

H3

X3

P(x1,x2,x3) = P(h1)P(x1|h1) P(h2|h1)P(x2|h2) P(h3 |h2)P(x3|h3)h3h2h1

= P(h1)P(x1|h1) b(h2) P(h1|h2)P(x2|h2)h1 h2

= b(h1) P(h1)P(x1|h1)h1

30

Example: Computing the MAP assignment

H1 H2

X1 X2

H3

X3

maximum = max P(h1)P(x1|h1) max P(h2|h1)P(x2|h2) max P(h3 |h2)P(x3|h3)

h3h2h1

= max P(h1)P(x1|h1) max b (h2) P(h1|h2)P(x2|h2)h1 h2h3

Replace sums with taking maximum:

= max b (h1) P(h1)P(x1|h1)

h1 h2{Finding the maximum}

h1* = arg max b (h1) P(h1)P(x1|

h1)h1

h2{Finding the map assignment}

h2* = x* (h1

*); h2

x* (h2)h3

x* (h1)h2

h3* = x* (h2

*)h3

31

Viterbi’s algorithm

For i=1 to L-1 do

h1* = ARG MAX P(h1) P(x1|h1) b (h1)

h2h2

hi+1* = x* (hi *)hi+1

Forward phase (Tracing the MAP assignment) :

H1 H2 HL-1 HL

X1 X2 XL-1 XL

Hi

Xi

x* (hi) = ARGMAX P(hi+1 | hi) P(xi+1 | hi+1) b (hi+1)

For i=L-1 downto 1 dob (hi) = MAX P(hi+1 | hi) P(xi+1 | hi+1) b (hi+1)

hi+1hi+1 hi+2

b (hL) = 1hL+1

hi+1hi+1 hi+2

Backward phase:

(Storing the best value as a function of the parent’s values)

basic model for genetic linkage analysis lecture #5 prepared by dan geiger

Documents