the questions why study haplotypes? how can haplotypes be inferred? what are haplotype blocks? how...

63
The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations with disease phenotypes? How shall we select a subset of informative SNPs for large-scale typing? How can haplotype information be visualized

Upload: derrick-price

Post on 19-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

The Questions

• Why study haplotypes?• How can haplotypes be inferred?• What are haplotype blocks?• How can haplotype information be used to

test associations with disease phenotypes?• How shall we select a subset of informative

SNPs for large-scale typing?• How can haplotype information be visualized

Page 2: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Methods for inferring haplotype blocks and informative SNP selection

Detecting haplotype blocks on Chromosomes 6,21,22

Page 3: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Hypothesis – Haplotype Blocks?

• The genome consists largely of blocks of common SNPs with relatively little recombination shuffling in the blocks

– Patil et. al, Science, 2001; Jeffreys et al. Nature Genetics; Daly et al. Nature Genetics, 2001

• Compare block detection methods.– How well we can detect haplotype blocks?– Are the detection methods consistent?

Page 4: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Block detection methods

• Four gamete test, Hudson and Kaplan,Genetics, 1985, 111, 147-164.

– A segment of SNPs is a block if between every pair (aA and bB) of SNPs at most 3 gametes (ab, aB, Ab, AB) are observed.

• P-Value test– A segment of SNPs is a block if for 95% of the pairs of

SNPs we can reject the hypothesis (with P-value 0.05 or 0.001) that they are in linkage equilibrium.

• LD-based, Gabriel et al. Science,2002,296:2225-9– Next slide

Page 5: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Gabriel et al. method

• For every pair of SNPs we calculate an upper and lower confidence bound on D’ (Call these D’u, D’l)

• We then split the pairs of SNPs into 3 classes:– Class I: Two SNPs are in ‘Strong LD’ if D’u > .98 and D’l > .7.– Class II: Two SNPs show ‘Strong evidence for recombination’ if

D’u < .9.– Class III: The remaining SNP pairs, these are “uninformative”.

• A contiguous set of SNPs is a block if – (Class II)/(Class I + ClassII) < 5%.

• Special rules to determine if 2, 3 or 4 SNPs are a block.• Furthermore there are distance requirements on the

chromosome to determine if the SNPs are a block.

Page 6: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Block View

Page 7: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Block comparison

Page 8: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Conclusions

• Clear evidence of “blocky” structure in Chromosomes

• Different block detection methods are highly concordant.

• However, boundaries defined by these methods are not sharp and we believe there is no single “true” block partition.

Page 9: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Block free SNP selection

Page 10: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

What does it mean to tag SNPs?

• SNP = Single Nucleotide Polymorphism– Caused by a mutation at a single position in human

genome, passed along through heredity– Characterizes much of the genetic differences

between humans– Most SNPs are bi-allelic– Estimated several million common SNPs (minor allele

frequency >10%

• To tag = select a subset of SNPs to work with

Page 11: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Why do we tag SNPs?

• Disease Association Studies– Goal: Find genetic factors correlated with disease– Look for discrepancies in haplotype structure– Statistical Power: Determined by sample size– Cost: Determined by overall number of SNPs typed

• This means, to keep cost down, reduce the number of SNPs typed

• Choose a subset of SNPs, [tag SNPs] that can predict other SNPs in the region with small probability of error– Remove redundant information

Page 12: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

What do we know?

• SNPs physically close to one another tend to be inherited together– This means that long stretches of the genome (sans mutational

events) should be perfectly correlated if not for…

• Recombination breaks apart haplotypes and slowly erodes correlation between neighboring alleles– Tends to blur the boundaries of LD blocks

• Since SNPs are bi-allelic, each SNP defines a partition on the population sample.– If you are able to reconstruct this partition by using other SNPs,

there would be no need to type this SNP– For any single SNP, this reconstruction is not difficult…

Page 13: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Complications:

• But the Global solution to the minimum number of tag SNPs necessary is NP-hard

• The predictions made will not be perfect– Correlation between neighboring tag SNPs

not as strong as correlation between neighboring (not necessarily tagged) SNPs

• Haplotype information is usually not available for technical reasons– Need for Phasing

Page 14: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

• Tagging SNPs can be partitioned into the following three steps:– Determining neighborhoods of LD: which

SNPs can infer each other– Tagging quality assessment: Defining a

quality measure that specifies how well a set of tag SNPs captures the variance observed

– Optimization: Minimizing the number of tag SNPs

Page 15: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Optimal Haplotype Block-Free Selection of Tagging SNPs for

Genome-Wide Association Studies

Halldorsson et al (2004)

Page 16: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

The Definition of Perfect

Prediction ofa SNP from a set

of SNPs

Page 17: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

“Predict a SNP” (cont)

A G T AA C A C1 2 3 4Site #

orSNP #

Hap1

Hap2

Nothing to Predict

Predicts SNP 3

PredictsSNP 4

Predicts Each of SNPs

2 and 4

Predicts

Predicts each of SNPs2 and 3

Page 18: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

A graphical notation

A G T A

A C A C

“ The Blue box Predicts the Green SNP”

Page 19: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Three SNPs Predicting Each Other

G T A

C A C

Only one of the three needs to be typed

Either one will do

Page 20: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

A Pair of SNPs Predicting Another SNP

G T A G

C T A T

G G T T

SNPs 1 and 3 together Predict SNP 4

No single SNP (different than SNP 4) can predict SNP 4

1 2 3 4

Page 21: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

• Tagging SNPs can be partitioned into the following three steps:– Determining neighborhoods of LD: which

SNPs can infer each other– Tagging quality assessment: Defining a

quality measure that specifies how well a set of tag SNPs captures the variance observed

– Optimization: Minimizing the number of tag SNPs

Page 22: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Finding Neighborhoods:

• Goal is to select SNPs in the sample that characterize regions of common recent ancestry that will contain conserved haplotypes

• Recent common ancestry means that there has been little time for recombination to break apart haplotypes

• Constructing fixed size neighborhoods in which to look for SNPs is not desirable because of the variability of recombination rates and historical LD across the genome

• In fact, the size of informative neighborhoods is highly variable precisely because of variable recombination rates and SNP density

• Authors avoid block-building by recursively creating neighborhood with help of ‘informativeness’ measure

Page 23: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

• A measure of tagging quality assessment• Assume all SNPs are bi-allelic• Notation:• I(s,t) = Informativeness of a SNP s with respect to a SNP t

– i, j are two haplotypes drawn at random from the uniform distribution on the set of distinct haplotype pairs.

– Note: I(s,t) =1 implies complete predictability, I(s,t)=0 when t is monomorphic in the population.

• I(s,t) easily estimated through the use of bipartite clique that defines each SNP

– We can write I(s,t) in terms of an edge set• Definition of I easily extended to a set of SNPs S by taking the union of

edge sets• Assumes the availability of haplotype phases• New measure avoids some of the difficulties traditional LD measures have

experienced when applied to tagging SNP selection– The concept of pairwise LD fails to reliably capture the higher-order

dependencies implied by haplotype structure

Defning Informativeness:

Page 24: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Bounded-Width Algorithm: k Most Informative SNPs (k-MIS)

• Input: A set of n SNPs S• Output: subset of SNPs S’ such that I(S’,S) is

maximal• In its most general form, k-MIS is NP-hard by

reduction of the set cover problem to MIS• Algorithm optimizes informativeness, although

easily adapted for other measures• Define distance between two SNPs as the

number of SNPs in between them• k-MIS can be solved as long as distance

between adjacent tag SNPs not too large

Page 25: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

• Define – Assignment As[i]

– S(As)

– Recursion function Iw(s,l, S(A)) = score of the most informative subset of l SNPs chosen from SNPs 1 through s such that As described the assignment for SNP s.

• Pseudocode

• Complexity: O(nk2w) in time and O(k2w) in space, assuming maximal window w

Page 26: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Evaluation• Algorithm evaluated by Leave-One-Out Cross-Validation

– accumulated accuracy over all haplotypes gives a global measure of the accuracy for the given data set.

• SNPs not typed were predicted by a majority vote among all haplotypes in the training set that were identical to the one being inferred– If no such haplotypes existed, the majority vote is taken among all

training haplotypes that have the same allele call on all but one of the typed SNPs

– etc.• When compared to block-based method of Zhang:

– Presumably, the advantage is due to the cost imposed by artificially restricting the range of influence of the few SNPs chosen by block boundaries

• ‘Informativeness’ was shown to be a “good” measure – aligned well with the leave-one-out cross validation results– extremely close to the results of optimizing for haplotype r2

Page 27: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Premise:Informative SNP selection

• Select SNPs to use in an association study– Would like to associate single nucleotide

polymorphisms (SNPs) with disease.

• Very large number of SNPs– Chromosome wide studies, whole genome-scans.– For cost effectiveness, select only a subset.

• Closely spaced SNPs are highly correlated– It is less likely that there has been a recombination

between two SNPs if they are close to each other.

Page 28: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

SNP selection within blocks

• Zhang et al. PNAS, 2002.

• Partition chromosome into haplotype blocks.• Zhang et al. RECOMB, 2003 • H. I. Avi-Itzhak,X. Su, F. M. De La Vega, PSB, 2003• Sebastiani et al. PNAS 2003• Patil et al., PNAS 2002.

• Within blocks one can select the SNPs that maximize entropy or diversity.

• Zhang et al. AJHG 2003.

• Select a minimal number of SNPs with limited resources.

Page 29: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Block free SNP selection

• For each SNP define a neighborhood of predictive SNPs.

• Define a measure of informativeness, how well a set of SNPs predicts a target SNP.

• Maximize informativeness over all SNPs.

Page 30: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

LD Graph Theory

The Definition of

Perfect Prediction of

a SNP from a set of SNPs

Combinatorial interpretations of intermediate values of D’ and r2

Page 31: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Distinguishing SNPs

G T A A G T A C A C G G A C A T

G T A A G T A C A C G G A C A T

G AG AA GA A

G AG CA GA T

SNPs distinguishing

every pair of haplotypes

Page 32: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Perfect Distinguishibility

G T T C G A C T A T T A

G T T C G A C A A C A T

A C G T A T C T A T T A

A C G C G A C A A T T A

Page 33: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Predictive SNPsG T A A G T T C A C G G A C A T

G T A A G T T C A C G G A C A T

Set of SNPs

Predicts

SNP s

s s

G A AG T CA G GA A T

G TG TA CA C

Page 34: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Perfect Prediction

G T T C G A C T A T T A

G T T C G A C A A C A T

A C G T A T C T A T T A

A C G C G A C A A T T A

Page 35: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

The Informativeness Duality Lemma

Let M be the SNPs/Haps matrix. S be the set of SNPs (columns). H be the set of Haplotypes (rows) T a subset of S.

The following are equivalent:(1) T perfectly predicts every SNP in S(2) T perfectly distinguishes every pair of distinct haplotypes in H

Page 36: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

“Predict a SNP” (cont)

A G T AA C A C1 2 3 4Site #

orSNP #

Hap1

Hap2

Nothing to Predict

Predicts SNP 3

PredictsSNP 4

Predicts Each of SNPs

2 and 4

Predicts

Predicts each of SNPs2 and 3

Page 37: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Informativeness

• Each SNP defines a partition on the set of chromosomes – Infer the value each SNP in the population.

• Our goal is to infer partitions defined by each one of the SNPs.

• Inferring the partition of every SNP allows us to infer any possible haplotype.

1 GGGAT

2 GCTGA

3 ACGAT

4 ACGAT

5 ACTGA

00111

s

3

12

45

Page 38: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Informativeness

– For a SNPs, and haplotypes I, J

Ds(I,J) is the event that SNP s has different alleles for haplotypes I, J

– Define I(s,t) = Pr(Ds(I,J) | Dt(I,J))– I(s,t) can be estimated from a

population sample• For each SNP s, define a bipartite

graph on the haplotypes

• Let E(s) denote the edge set

I(s,t) = |E(s) E(t)| / |E(t)| I(S,t) = |s SE(s) E(t)| / |E(t)| I(S,T) = tT I(S,t)

01111

s

I(s,t)

00111

t

Page 39: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

The Minimum Informative SNPs problem

• Given a set S of SNPs, compute

• The problem is NP-complete in general– Reduction from set cover

• Tractable in practice– When only nearby SNPs are used as candidates

arg max S’ S, |S’| k I(S’,S\S’)

Page 40: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Bounded Width MIS• Only neighboring SNPs inform meaningfully

– SNP i can only be used to infer SNP j if there is little evidence of recombination between i and j

• I(w,S,t) = Informativeness of S w.r.t t when restricted to SNPs in S that are within w/2-neighborhood of t.

• (k,w)-MIS problem: – Given a set T, compute the k most informative

SNPs S that minimize I(w,S,T)• (k,w)-MIS can be computed in time O(nk2w),

and space O(k2w)

Tt

tSwITSwI ),,(),,(

Page 41: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Correct imputationBlock vs. block free

Zhang et al.

Block Free

Perlegen dataset

#SNPs typed#SNPs typed

# correctimputations

Page 42: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Correlation of informativeness with imputation in leave one out studies

Leave one out

Block free

Perlegen dataset

#SNPs#SNPs

Informativeness

Page 43: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Haplotype blocks

Page 44: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Haplotype Blocks

Page 45: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Union of possible haplotype blocks

Page 46: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Block free – SNPs selected

Page 47: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Haplotype block tagging SNPs

Page 48: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Haplotype block tagging SNPs

Page 49: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

The Definition of Perfect

Prediction ofa SNP from a set

of SNPs

Page 50: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

“Predict a SNP” (cont)

A G T AA C A C1 2 3 4Site #

orSNP #

Hap1

Hap2

Nothing to Predict

Predicts SNP 3

PredictsSNP 4

Predicts Each of SNPs

2 and 4

Predicts

Predicts each of SNPs2 and 3

Page 51: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

A graphical notation

A G T A

A C A C

“ The Blue box Predicts the Green SNP”

Page 52: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Three SNPs Predicting Each Other

G T A

C A C

Only one of the three needs to be typed

Either one will do

Page 53: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

A Pair of SNPs Predicting Another SNP

G T A G

C T A T

G G T T

SNPs 1 and 3 together Predict SNP 4

No single SNP (different than SNP 4) can predict SNP 4

1 2 3 4

Page 54: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Homework

G T A G

C T A T

G G T T

Find the minimum subset of SNPs that needs to be typed; I.e., from which the rest of the SNPscan be Predicted.

Page 55: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Answer: Solution 1 = Type SNPs 1 and 3

G T A G

C T A T

G G T T

From SNPs 1 and 3 we can predict SNP 4From SNP 3 we can predict SNP 2

Another solution (maybe better for Mercury SNPs : )

Solution 2 = Type SNPs 1 and 2.

Page 56: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Informativeness of a SNPInformativeness of a SNP s with respect with SNP t Quantifies the confidence with which we can predict t from s.

Le s be a SNP and i,j be haplotypes.Let D(s, i, j) be the event that at s, i and j haps have different alleles

The informativeness of s w.r.t. t is given by

I(s,t) = Prob [ D(s,i,j) | D(t,i,j) ]

i and j are haplotypes drawn uniformly at randomfrom the set of all distinct haplotype pairs.

Page 57: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

The Min Informative Subset Problems

Observe that:I(s,t) = 1 implies perfect predictionI(s,t) = 0 implies no predictability

The Minimum Perfectly Informative Subset of SNPs ProblemInput: A set of n SNPs S, a subset T of S, and 0<k<=nOuput: Does there exist a subset S’ of S-T such that I(S’,T) = 1 and size of S <= k ?

The k-Most Informative Subset of SNPs ProblemInput: A set of n SNPs S, with a subset T of S, and 0<k<=nOuput: Find a subset S’ of S-T such that I(S’,T) = MAX {I(S”, T)} and size of S” <= k ?

Page 58: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Basic Insight: The Set Cover Problem

The Minimum Perfectly Informative Subset of SNPs Problem is NP-colpmete

The k-Most Informative Subset of SNPs Problem is NP-complete

Page 59: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Graph Theory – Min Set Cover

Set

Set

Set

Set

elements

BOYS GIRLS

Want: Min number of Sets that cover all elements

Or Min number of GIRLS that know all the BOYS

Page 60: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Our Boys and Girls …

For a SNP t, the elements are the set of pairs of haplotypesthat are distinguished by t.

Each SNP s defines a set consisting of all pairs of haplotypes that is distinguished by both s and t.

The Minimum Set Cover is Minimum subset of SNPs that Perfectly Predicts the entire sample.

The elements:

The sets:

Page 61: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Algorithms

When S is a set of SNPs in perfect LD with each other(I.e., all in a no 4-gamete block) the k-Most Informative Subset of SNPs can be solved exactly in O(nm) time.

n number of SNPsm number of Haplotypes

When the distance in SNPs between the predicting SNP(s) andthe target SNP is at most w , the (k,w)-Most Informative Subset of SNPs problem can be solved exactly in timeO(nk2^w) and space O(k2^w).

ALGORITM 1

ALGORITM 2

Page 62: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations

Block free SNP selection

Page 63: The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations