2002wang_haplotype inference by maximum parsimony
TRANSCRIPT
-
8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony
1/17
Haplotype Inference by Maximum Parsimony
Lusheng Wang
Department of Computer Science
City University of Hong Kong
Kowloon, Hong Kong, China
E-mail: [email protected]
Ying Xu
Department of Computer Science
Peking University
Beijing, 100871, China
E-mail: [email protected]
Abstract
Motivation: Haplotypes have been attracting increasing attention because of their impor-
tance in analysis of many fine-scale molecular-genetics data. Since direct sequencing of haplotype
via experimental methods is both time-consuming and expensive, haplotype inference methods
that infer haplotypes based on genotype samples become attractive alternatives.
Results: We propose a new model for haplotype inference that finds a set of minimum
number of haplotypes that explains the genotype samples. Experiments on both real data and
simulation data confirm the new model. Our new approach outperforms the existing methods
in many cases.
Availability: The software HAPAR is free for non-commercial uses. Available upon request
1
-
8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony
2/17
1 Introduction
Single nucleotide polymorphisms (SNPs) are the most frequent form of human genetic variation.
The SNP sequence information from each of the two copies of a given chromosome in a diploid
genome is a haplotype. Haplotype information has been attracting great attention in recent years
because of its importance in analysis of many fine-scale molecular-genetics data, such as in the
mapping of complex disease genes, inferring population histories and drug design. However, cur-
rent routine sequencing methods typically provide genotype information (It consists of a pair of
haplotype information at each position of the two copies of a chromosome in diploid organisms.
However, the connection between two adjacent positions is not known.) rather than haplotype in-
formation. Since direct sequencing of haplotype via experimental methods is both time-consuming
and expensive, in silico haplotyping methods become attractive alternatives.
The haplotype inference problem is as follows: given an n k genotype matrix, where each cell
has value 0, 1, or 2. Each of the n rows in the matrix is a vector associated with sites of intereston the two copies of a chromosome for diploid organisms. The state of any site on a single copy of
a chromosome is either 0 or 1. A cell (i, j) in the i-th row has a value 0 or 1 if the chromosome
site has that state (0 or 1) on both copies, and it has a value 2 if both states are present at this
site (one for each copy). A cell is resolved if it has value 0 or 1, and ambiguous if it has value 2.
The goal here is to determine which copy of the chromosome has value 1 and which copy of the
chromosome has value 0 at the sites with value 2 based on some mathematical models.
In 1990, Clark first discovered that genotypes from population samples were useful in recon-
structing haplotypes and proposed an inference method. After that, many algorithms and programs
have been developed to solve the haplotype inference problem. The existing algorithms can be di-vided into four primary categories. The first category is Clarks inference rule approach that is
exemplified in Clark (1990) and extended in Gusfield (2001) by trying to maximize the number of
resolved vectors. The second category is expectation-maximization (EM) method which looks for
the set of haplotypes maximizing the posterior probability of given genotypes (Excoffier and Slatkin,
1995; Hawley and Kidd, 1995; Long et al., 1995; Chiano and Clayton, 1998). Recently, several sta-
tistical methods based on Bayesian estimators and Gibbs sampling were proposed (Stephens et al.,
2001; Niu et al., 2002; Zhang et al., 2001). Finally, adopting the no-recombination assumption,
Gusfield proposed a model that finds a set of haplotypes forming a perfect phylogeny (Gusfield,
2002; Bafna et al., 2002).
In this paper, we propose a new model for the haplotype inference problem that finds a set of
minimum number of haplotypes that explains the given genotypes. Based on the new model, an
exact algorithm is designed and implemented. Experiments on both real data and simulation data
confirm the new model. Simulation results show that our approach outperforms existing methods
in most cases.
2
-
8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony
3/17
2 The new model
Given an n k genotype matrix M, where each cell has value 0, 1, or 2, a 2n k haplotype matrix
Q that explains M is obtained as follows: (1) duplicate each row i in M to create pairs of rows i,
and i (for i = 1, 2, . . . n) in Q and (2) re-set each pair of cells Q(i, j) = 2 and Q(i, j) = 2 to be
either Q(i, j) = 0 and Q(i, j) = 1 or Q(i, j) = 1 and Q(i, j) = 0 in the new resulting 2nk matrix
Q. Each row in M is called a genotype. For a genotype mi (the i-th row in M), the pair of finally
resulted rows Qi and Qi form a resolution ofmi. We also say that Qi and Qi resolve the genotype
mi. For a given genotype matrix M, if there are h 2s in M, then any of the 2h possible haplotype
matrices can explain M. Thus, without any further assumptions, it is hard to infer the haplotypes.
Here we propose a new model that finds a set of minimum number of haplotypes that explains
the genotype samples as follows: given an nk genotype matrix M, find a 2nk haplotype matrix
Q such that the number of distinct rows in Q is minimized. n is often referred to as the sample
size. The computation problem is called the minimum number of origins.
2.1 Supports of the new model
Our new method is based on the parsimony principle that attempts to minimize the total number
of haplotypes observed in the sample. The parsimony principle is one of the most basic principles in
nature, and has been applied in numerous biological problems. In fact, Clarks inference algorithm,
which has been extensively used in practice and shown to be useful (Clark et al., 1998; Rieder et
al., 1999; Drysdale et al., 2000), can also be viewed as a sort of parsimony approach. However,
to apply Clarks algorithm, there must be homozygotes or single-site heterozygotes in the sample.
Our method overcomes this obstacle by proposing a global optimization goal.The characteristics of real biological data also provide justifications for our method. The number
of haplotypes existing in a large population is actually very small whereas genotypes derived from
these limited number of haplotypes behave a great diversity. Theoretically, given m haplotypes,
there are m(m 1)/2 possible pairs to form genotypes. (Even if Hardy-Weinberg equilibrium is
violated and some combinations are rare, the number is still quite large.) When some population
is to be studied, the haplotype number can be taken as a fixed constant, while the number of
distinct genotypes is decided by the sample size sequenced, which is relatively large. Intuitively,
when genotype sample size n is large enough, the corresponding number of haplotypes m would be
relatively small. Thus, our approach has a good chance to correctly recover the whole haplotypeset.
A real example strongly supports above arguments.
3
-
8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony
4/17
Nucleotide -1023 -709 -654 -468 -367 -47 -20 46 79 252 491 523
Alleles G/A C/A G/A C/G T/C T/C T/C G/A C/G G/A C/T C/A
h1 A C G C T T T A C G C C 100000010000
h2 A C G G C C C G G G C C 100111101000h3 G A A C T T T A C G C C 011000010000
h4 G C A C T T T A C G C C 001000010000
h5 G C A C T T T G C G C C 001000000000
h6 G C G C T T T G C A C A 000000000101
h7 G C G C T T T G C A T A 000000000111
h8 G C A C T T T A C A C A 001000010101
h9 G C G C T T T G C A C C 000000000100
h10 G C G C T T T G C G C C 000000000000
Table 1: 10 haplotypes of 2AR genes. Nucleotide number is the position of the site, basedon the first nucleotide of the starting codon being +1. Allele is the two nucleotide possibilities at
each SNP site. These data are from Drysdale et al. (2000). The original paper gave 12 haplotypes
of 13 SNP sites. In this table, we only listed 10 haplotypes which were found in the asthmatic
cohort, and one rare SNP site which did not show ambiguity in the sample was excluded. The last
column of each haplotype is the representation of that haplotype. Each haplotype is a vector of
SNP values. For each SNP, we assume the first nucleotide in Alleles to be wild type (represented
with 0), and second one to be mutant (represented with 1).
2.2 A real example exactly fitting the new model
2-adrenergic receptors (2ARs) are G protein-coupled receptors that mediate the actions of cat-
echolamines in multiple issues. In Drysdale et al. (2000), 13 variable sites within a span of 1.6kb
were reported in the human 2AR gene. Only 10 haplotypes were found to exist in the studied asth-
matic cohort, far less than theoretically possible 213 = 8192 combinations. 18 distinct genotypes
were identified in the sample consisting of 121 individuals. Those 10 haplotypes and 18 genotypes
are illustrated in table 1 and table 2, respectively. In this data set, the genotype number(18) is
relatively large with respect to the haplotype number(10). Computation shows that the minimum
number of haplotypes needed to generate the 18 genotypes is 10, and given the 18 genotypes as
input, the set of haplotypes inferred by our algorithm (see next section) is exactly the original set.
3 An exact algorithm
To verify our model with more real data and simulation data, we design an exact algorithm to find
the optimal solution for the minimum number of origins problem. The basic idea is very simple.
Given a set of genotypes (rows in the genotype matrix) M = {m1, m2, . . . , mn}, the algorithm works
4
-
8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony
5/17
genotype resolution value
g1 (h2,h4) 202222222000
g2 (h2,h2) 100111101000
g3 (h2,h6) 200222202202
g4 (h4,h4) 001000010000
g5 (h4,h6) 002000020202
g6 (h2,h5) 202222202000
g7 (h4,h9) 002000020200
g8 (h1,h4) 202000010000
g9 (h1,h6) 200000020202
g10 (h2,h10) 200222202000
g11 (h2,h3) 222222222000
g12 (h2,h7) 200222202222
g13 (h2,h8) 202222222202
g14 (h3,h4) 021000010000
g15 (h4,h5) 001000020000g16 (h4,h7) 002000020222
g17 (h4,h8) 001000010202
g18 (h6,h7) 000000000121
Table 2: 18 genotypes of 2AR genes. The second column of each genotype is the (true)
resolution to that genotype. For example, genotype g1 is resolved by haplotypes h2 and h4. The
third column is the representation of that genotype. Each genotype is a vector of SNP values and
the value of each SNP is 0, 1 or 2, standing for homozygous wild type, homozygous mutant, or
heterozygous.
5
-
8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony
6/17
Figure 1: An example of a search tree containing three levels, i.e., there are three rows
(genotypes) in the given genotype matrix. s1 = 4, s2 = 2, and s3 = 3.
as follows: (1) list all possible resolutions for each genotype; (2) construct a haplotype matrix by
choosing one resolution from each genotype; and (3) all such haplotype matrices are examined by
a branch-and-bound searching and a matrix with the minimum number of distinct haplotypes is
output.
Consider a genotype mi (the i-th row in M). Let si be the number of possible resolutions to
mi. All possible resolutions for mi are stored in a list {ri1 = (h
i1,1, h
i1,2), . . . , r
isi
= (hisi,1, hisi,2
)}.
A haplotype matrix (r1j1 , r2j2
, . . . , rnjn) consists of n resolutions, one for each genotype mi in M. A
partial solution consists of i resolutions (r1j1 , . . . , riji
), where i n. The size of a solution is defined
as the number ofdistincthaplotypes contained in these n resolutions. We are looking for a solution
with the minimum size.
We use depth-first search to exhaustively enumerate all n
i=1si possible solutions and find anoptimal one. Figure 1 gives an example of a search tree containing ni=1si possible solutions (leaves).
The size of the search tree is too big. Thus, we use branch-and-bound approach to reduce the search
space. Assume that we have obtained a solution of size x. If in the search process, we obtain a
partial solution with size at least x, then we do not further extend the solution. This can save lots
of time.
Theoretically, the running time of the above described algorithm could still be exponential in
terms of the input size. In order to make our program efficient enough for practical use, we made
some further improvements to reduce the running time.
Improvement 1: Choosing a tight initial bound
If we have a good (small size) solution, then we are able to prune many unnecessary searching
paths. Before executing branch-and-bound searching, we run a greedy algorithm to obtain a solu-
tion, and use this solution as an initial bound in the branch-and-bound search. The coverage of a
haplotype is the number of genotypes that the haplotype can resolve. The coverage of a resolution
is the sum of the coverage for the two haplotypes of the resolution. The greedy algorithm simply
chooses from each genotype a resolution with maximum coverage to form a solution. This heuristic
6
-
8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony
7/17
algorithm can often give a solution with size close to the optimum.
Improvement 2: Reducing the search space
Since we only report one optimal solution, if some possible resolutions are equally good, we
can keep only one representative and discard the others. We only consider two cases in our program.Case 1: Two resolutions to the same genotype mi both have coverages 2. In this case, none of
the four haplotypes contained in the two resolutions appears in any other genotypes. Thus, we just
keep one of the resolutoins in the resolutoin list of mi.
Case 2: Consider two genotypes mi and mj . Suppose mi has two resolutions (h1, h2) and
(h4, h5) and mj has two resolutions (h2, h3) and (h5, h6). Ifh1, h3, h4 and h6 have coverage 1, and
h2 and h4 have coverage 2, then we only have to keep the combination (h1, h2) and (h2, h3) and
delete the combination (h4, h5) and (h5, h6).
The idea can be extended to more sophisticated cases. However, those cases do not happen
very often and may not help much in practice. After applying this trick, the number of possible
resolutions and the number of candidate haplotypes (haplotypes contained in resolutions) are dra-
matically cut down. For example, when we run our program on ACE data containing 11 individuals
of 52 SNPs (See Section 4 for details.), the number of candidate haplotypes is only 483, which is
far less than the total of 252 possible haplotypes.
We implement the above algorithm with C++. The program, HAPAR, is now available upon
request. It takes a file containing genotype data as input, and outputs resolved haplotypes. With
all these improvements, our program is fairly efficient. For example, it takes HAPAR only 2.25
minutes on our computer to compute ACE data. In contrast, it takes PHASE, a program based on
Gibbs sampling, 12 minutes to compute ACE in the same environment. Programs based on EM
method even cannot handle this set of data.
4 Results
We ran our program HAPAR for a large amount of real biological data as well as simulation data
to demonstrate the performance of our program. We also compared our program with four famous
existing programs, HAPINFERX, Emdecoder, PHASE, and Haplotyper. HAPINFERX is an im-
plementation of Clarks algorithm (Clark, 1990), and was kindly provided by A.G.Clark. Emdecoder
uses an EM alogrithm, and was downloaded at J. Lius homepage (http://www.people.fas.harvard.edu/junliu/
PHASE is a Gibbs sampler for haplotyping (Stephenset al.
, 2001), and was downloaded at M.Stephens homepage. Haplotyper is a Bayesian method, and was downloaded at J. Lius homepage.
We will discuss different sets of data in the following subsections.
7
-
8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony
8/17
program result haplotype number correct haplotype number recovery rate accuracy rate
HAPAR 7 7 0.54 1
Haplotyper 13 7 or 9 0.54 or 0.69 0.54 or 0.69
HAPINFERX 7 7 0.54 1PHASE 13 7 0.54 0.54
Table 3: Comparison of performance of four programs on ACE data set. Haplotyper has
two different accurate rates/ recovery rate because it gives different results in multiple runs.
4.1 Angiotensin converting enzyme
Angiotensin converting enzyme (encoded by the gene DCP1, also known as ACE) catalyses the
conversion of angiotensin I to the the physiologically active peptide angiotensin II. Due to its
key function in the renin-angiotensin system, many association studies have been performed with
DCP1. Rieder et al. (1999) completed the genomic sequencing ofDCP1 from 11 individuals, and
identified 78 varying sites in 22 chromosomes. 52 out of the 78 varying sites are non-unique poly-
morphic sites, and complete data on these 52 biallelic markers are available. 13 distinct haplotypes
were resolved from the sample.
We ran the four programs, HAPAR, Haplotyper, HAPINFERX and PHASE, on ACE data set
(11individuals/10genotypes, 52SNPs, 13haplotypes). (Emdecoder is limited in the number of SNPs
in genotype data and thus is excluded.) The result is summarized in table 3. The recovery rate is
defined asno. of correctly detected haplotypes
total no. of true haplotypes.
The accurate rate is defined as
no. of correctly detected haplotypes
total no. of inferred haplotypes.
Most programs can recover 7 haplotypes out of 13 original ones. The low performance was due
to the relative small distinct genotype number. In fact, 3 genotypes are resolved by 6 haplotypes
each of which appears only once, so there is no enough information for any of the 4 programs to
resolve those 3 genotypes successfully. In some runs, Haplotyper can guess one resolution correctly
(thus 9 haplotypes are correctly recovered), but it cannot get a consistent result. Note that unlike
statistical methods (e.g. Haplotyper and PHASE), HAPAR and HAPINFERX do not try to guess
resolutions for genotypes when there is no enough information. Therefore, both of them only report
7 haplotypes with a 100% accuracy rate. In contrast, Haplotyper and PHASE report 13 haplotypes,
some of which are inaccurate.
8
-
8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony
9/17
4.2 Simulations on random data sets
In this subsection, we use simulation data to evaluate different programs. First, m haplotypes, each
containing k SNPs sites, were randomly generated. Then a sample of n genotypes were generated,
each of which was formed by randomly picking up two haplotypes and conflating them. (Here n
is the sample size, which may be larger than the number of distinct genotypes.) A haplotyping
program resolved those genotypes and output inferred haplotypes, which were then compared with
the original haplotypes to evaluate the performance of the program.
An important question is how to generate random haplotypes. Since some algorithms impose
particular assumption on evolutionary model, simulation data sets generated based on some model
will certainly favor one program while impair others. As pointed out in Niu et al. (2002), The
performances of different in silico haplotyping methods is subtler than it appears to be - that is,
the model underlying the simulation study can greatly affect the conclusion. We decide not to
adopt any particular evolutionary model when generating haplotypes. We simply set every bitin every haplotypes randomly and independently to be 0 or 1 with some probability. (To test the
applicability of our method under different conditions, we also did simulations on data sets favoring
particular evolutionary models, which are discussed later in this paper.)
The performance of a haplotype inference program contains two aspects: (1) the accurate rate
and (2) the recovery rate. Figure 2 (a) and Figure 2 (b) illustrated the accurate rate and recovery
rate of different programs under parameter setting m = 10 and k = 10. It can be seen from the
figures that for HAPAR, Haplotyper, Emdecoder and PHASE, the two measurements do not make
much difference, whereas HAPINFERX has an accurate rate better than recovery rate (this is
because HAPINFERX will leave some orphan genotypes when they are not resolvable by existing
haplotypes). This fact was also supported by other simulation results. Therefore, in the rest of the
section, we use the arithmetic mean of these two values, (accurate rate + recovery rate) 0.5, as
the measurement for performance.
Also, in the rest of the section, 100 data sets were generated for each parameter setting, and
performance was calculated by taking the average of the performance values in the 100 runs.
We conducted simulations with different parameter settings and compared the performance
of the five programs HAPAR, HAPINFERX, Haplotyper, Emdecoder and PHASE. Two sets of
parameters were used: (1) m = 10, k = 10, n ranges from 5 to 24 (see Figure 2 (a) to Figure 2
(c)), and (2) m = 18, k = 8, n ranges from 9 to 40 (see Figure 2 (d)). As shown by the figures,
HAPAR outperforms the other four programs in almost all cases. When n is as small as m/2 (everyhaplotype appears in only one genotype), any resolution combination would be an optimal solution
for the minimum number of origins problem, so HAPAR cannot identify the correct resolutions.
In this case, other programs also have poor performance due to the lack of information. When n
becomes larger, all five programs gain an increasing in performance, and HAPAR shows an obvious
advantage over the others. When n is large enough (n = 12 for m = 10, k = 10; and n = 30
9
-
8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony
10/17
for m = 18, k = 8), HAPAR, Haplotyper, Emdecoder and PHASE can all recover the original
haplotypes successfully with high probability (the accurate rate and recovery rate are both greater
than 0.9).
0
0.2
0.4
0.6
0.8
1
4 6 8 10 12 14 16 18 20 22 24
accuracy
rate
sample size
HAPARHAPINFERXHaplotyperEmdecoder
PHASE
(a) Accurate rate (m = 10; k = 10)
0
0.2
0.4
0.6
0.8
1
4 6 8 10 12 14 16 18 20 22 24
recovery
rate
sample size
HAPARHAPINFERXHaplotyperEmdecoder
PHASE
(b) Recovery rate (m = 10; k = 10)
0
0.2
0.4
0.6
0.8
1
4 6 8 10 12 14 16 18 20 22 24
performance
sample size
HAPARHAPINFERX
HaplotyperEmdecoder
PHASE
(c) Performance (m = 10; k = 10)
0
0.2
0.4
0.6
0.8
1
10 15 20 25 30 35 40
performance
sample size
HAPARHAPINFERX
HaplotyperEmdecoder
PHASE
(d) Performance (m = 18; k = 8)
Figure 2: Comparison of the performance of five programs on random data. In Figure 2
(a) to (c), originally there are 10 haplotypes of 10 SNPs; In Figure 2 (d), originally there are 18
haplotypes of 8 SNPs. Performance in Figure 2 (c) and (d) is the arithmetic mean of accurate rate
and recovery rate. For each parameter setting, 100 data sets were generated, and performance,
accurate rate and recovery rate were calculated by taking average of 100 runs.
4.3 Simulations on maize data set
The maize data were used in Wang and Miao (2002) as one of the benchmarks to evaluate hap-
lotyping programs. The locus 14 of maize profile containing 17 SNP sites and 4 haplotypes (with
10
-
8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony
11/17
sample size HAPAR Haplotyper Emdecoder HAPINFERX PHASE
3 0.60 0.55 0.38 0.38 0.56
4 1 1 1 0.625 1
7 1 1 1 0.625 110 1 1 1 0.9 1
Table 4: Comparison of performance of five programs on Maize data set (m = 4, k = 17).
frequency 9, 17, 8 and 1) were identified. We randomly generated a sample of n genotypes from
these haplotypes, each of which was formed by randomly picking up 2 haplotypes according to
their frequencies and conflating them. The results were summarized in Table 4. According to our
experiments, all of HAPAR, Haplotyper, Emdecoder and PHASE can recover the haplotypes cor-
rectly when the sample size n is greater than or equal to 4, while HAPINFERX does not produce a
satisfying result until sample size n reaches 10. When the sample size is 3, HAPAR behaves slightlybetter than other programs, but none of them produces satisfying results.
4.4 Simulations on haplotypes forming a perfect phylogeny
Coalescence is one of the evolutionary model most commonly used in population genetics. The
coalescent model of haplotype evolution says that without recombination, haplotypes can fit into
a perfect phylogeny (Gusfield, 2001). Jin et al. (1999) found a 565bp chromosome 21 region near
the M X1 gene, which contains 12 polymorphic sites. This region is unaffected by recombination
and recurrent mutation. The genotypes determined from sequence data of 354 human individuals
were resolved into 10 haplotypes, the evolutionary history of which can be modeled by a per-fect phylogeny. These 10 haplotypes were used to generate genotype samples of different size for
evaluation.
The performance of the five programs on MX1 data is compared in Figure 3. PHASE performs
better than the other programs, because it incorporates coalescent model which fits this data
set. In the remaining four programs which do not adopt coalescent assumption, we see that the
performance of HAPAR is better than others when the sample size is relatively large (n > 20). Note
that performance of the other four programs almost remains still when the sample size is greater
than 20, whereas HAPAR still gets a continuing increase as the sample size increases. When the
sample size is as large as 40, HAPAR even beats PHASE. This supports our hypothesis that whenthe sample size is large enough, our algorithm is likely to give accurate results.
4.5 Simulations on haplotypes with recombination hotspots
To verify the robustness of our method in the presence of recombination events, we conduct sim-
ulations using the data on chromosome 5q31 studied in Daly et al. (2001). They reported a
11
-
8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony
12/17
0
0.2
0.4
0.6
0.8
1
5 10 15 20 25 30 35 40
performan
ce
sample size
HAPARHAPINFERXHaplotyperEmdecoder
PHASE
Figure 3: Comparison of performance of five programs on MX1 data set (m = 10, k = 12).
high-resolution analysis of the haplotype structure across 500kb on chromosome 5q31 using 103
SNPs in a European-derived population. The result showed a picture of discrete haplotype blocks
(of tens to thousands of kilobases), each with limited diversity. The discrete blocks are separated
by intervals in which several independent historical recombination events seem to have occurred,
giving rise to greater haplotype diversity for regions spanning the blocks.
We use the haplotypes from block 9 (with 6 sites, and 4 haplotypes) and block 10 (with 7
sites, 6 of which are complete, and 3 haplotypes). There is a recombination spot between the two
blocks, which is estimated to have a haplotype exchange rate of 27%. 9 new haplotypes with 12
sites are generated by connecting two haplotypes from block 9 and block 10 which were observedto have common recombination events, and their frequencies were normalized. Genotype samples
of different sizes were randomly generated and used for evaluation.
According to the experiment results illustrated in Figure 4, Emdecoder, HAPAR, Haplotyper
and PHASE have similar performance on these data sets. A strange phenomenon is that their
performance increases very slowly after it reaches approximately 0.9. This is due to the confusion
of recombination. Take a sequence of two SNP sites as an example. There are 4 haplotypes, 00,
01, 10 and 11. Both the combinations (00, 11) and (01, 10) result in genotype 22. The haplotype
11 is rare since its two sites are both mutant, while haplotype 00 is the most common. According
to their frequencies, 11 is difficult to observe, and even if it appears, it is probably in genotype
22, which will likely be resolved into (01, 10). Unless the sample size is large enough so that the
genotype formed by (11, 01) or (11, 10) is sequenced, the existence of haplotype 11 will not be
detected by haplotype inference programs. That is why the performance curves in figure 4 increase
slowly at around 0.9. As mentioned above, if the sample size is large enough, the chance that the
rare haplotypes are recovered will be good. In this data set, when the sample size is raised to 50,
the performance of HAPAR can reach 0.96.
12
-
8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony
13/17
0
0.2
0.4
0.6
0.8
1
4 6 8 10 12 14 16 18 20
performan
ce
sample size
HAPARHAPINFERXHaplotyperEmdecoder
PHASE
Figure 4: Comparison of performance of five programs on 5q31 data set (m = 9, k = 12).
4.6 Influence of the number of haplotypes and the number of SNP sites on
performance
We also examined the influence of m, the number of haplotypes, and k, the number of SNP sites,
on the performance of our algorithm. Original haplotype sets and then genotypes were generated
randomly.
Figure 5 (a) to Figure 5 (c) illustrate the influence of the number of original haplotypes m on
performance when k is fixed. In the three figures, we fix k = 6, 10, 12 respectively; and in each
figure, performance under three different numbers of haplotypes, 8, 10 and 12, is compared. All
the three figures show that, with the same sample size n and the same number of SNP sites k, the
performance gets better when the number of haplotypes m decreases, which is consistent with the
intuition. In each figure, three performance curves have similar shapes. At first, the performance
increases fast as the sample size n increases, and then after the sample size n becomes large enough
and the performance becomes fairly high (above 0.9), the increasing pace slows down. (This also
holds for all above data sets.) Moreover, if we compare the three figures, we can see that the curves
in Figure (c) are steeper than those in Figure (a). That is, the increasing pace of performance
is larger when the number of SNP sites is larger. This relationship is more clearly illustrated in
Figure 5 (d) to Figure 5 (f).
Figure 5 (d) to Figure 5 (f) show the influence of different numbers of SNP sites on performancewhen m is fixed. In the three figures, we fix m = 8, 10, 12, respectively; and in each figure, the
performance under four different numbers of SNP sites, 6, 10, 12 and 15, is compared. The bigger
the number of SNP sites is, the faster the performance increases as the number of sample size
increases. When the sample size is reasonably large, with the same sample size n and the original
haplotype number m, HAPAR is more likely to detect the correct haplotypes when the number of
13
-
8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony
14/17
SNP sites k is larger. This is because when k is small, it is more likely that the four patterns 00, 01,
10 and 11 exist simultaneously in the original haplotype set, which causes confusion to HAPAR.
5 Discussion
Haplotypes are the raw material of many genetic analysis, but the rapid growth of high-throughput
genotyping techniques has not been matched by similar advances in cheap experimental haplotype
determination. In this paper, we have introduced a new method for this problem. Experiments
on both real data and simulation data confirm this new method. Our method imposes no assump-
tion on the population evolutionary history. While integrating particular evolutionary model into
haplotyping algorithm may favor a special class of data, it will fail to infer haplotypes that do not
fit into the model. In contrast, our method has the widest applicability. Experiments show that
HAPAR is comparable to the best haplotyping programs on almost all data sets, and when the
simulation data are generated fully randomly, it outperforms other programs.
Our new method raised some interesting topics for further investigation. Does there exist a
polynomial time algorithm to solve the minimum number of origins problem? (We guess that the
problem is NP-hard.) If no polynomial time algorithm exists, an efficient approximation algorithm
with reasonable performance guarantee is also desirable. Moreover, it is interesting to design an
efficient algorithm to substitute the branch-and-bound searching procedure, i.e. given a set of
genotypes and a set of candidate haplotypes, the problem here is to find a minimum size subset of
candidate haplotypes that can resolve all genotypes. Although the size of candidate haplotype set
could be exponential to the original problem size in the worst case, the size is acceptable in many
practical examples. Therefore, we are interested in accelerating the searching procedure.Another interesting problem is under what condition the parsimony algorithm is able to infer
all haplotypes correctly with high probability. All simulations illustrate that the performance of
HAPAR increases monotonically as the sample size becomes larger, and it is conjectured that when
the sample size is big enough, our algorithm can recover exactly the original haplotype set. With
the number of original haplotypes m and the number of SNP sites k fixed, we consider a sample
size to be able to recover haplotypes with high probability if the average performance of 100 runs
under the parameter setting is greater than 0.9, and the minimum of such sample size is referred to
as the threshold sample size. Simulations have been conducted to examine the relationship between
the threshold sample size and the number of haplotypes m. The result for k = 10 is summarizedin table 5. When m increases, the threshold sample size increases. Figure 5 shows that with the
same m value, the threshold sample size decreases when k increases. Further statistical analysis
may help to explain this relationship.
Acknowledgements
We thank A. Clark for kindly giving us HAPINFERX. We also thank J. Liu and M. Stephens
14
-
8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony
15/17
0
0.2
0.4
0.6
0.8
1
5 10 15 20 25 30
performance
sample size
8 haplotypes10 haplotypes12 haplotypes
(a) performance under different m (k=6)
0
0.2
0.4
0.6
0.8
1
5 10 15 20 25 30
performance
sample size
8 haplotypes10 haplotypes12 haplotypes
(b) performance under different m (k=10)
0
0.2
0.4
0.6
0.8
1
4 6 8 10 12 14 16 18 20 22 24
performance
sample size
8 haplotypes10 haplotypes12 haplotypes
(c) performance under different m (k=12)
0
0.2
0.4
0.6
0.8
1
4 6 8 10 12 14 16 18
performance
sample size
6 SNP sites10 SNP sites12 SNP sites15 SNP sites
(d) performance under different k (m=8)
0
0.2
0.4
0.6
0.8
1
5 10 15 20 25 30
performance
sample size
6 SNP sites10 SNP sites12 SNP sites15 SNP sites
(e) performance under different k (m=10)
0
0.2
0.4
0.6
0.8
1
5 10 15 20 25 30
performance
sample size
6 SNP sites10 SNP sites12 SNP sites15 SNP sites
(f) performance under different k (m=12)
Figure 5: Comparison of performance under different number of haplotypes and number
of SNP sites. (a) to (c) illustrate the influence of different m on performance. (d) to (f) illustrate
the influence of different k on performance.
15
-
8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony
16/17
original haplotype number 5 6 8 10 12 14 15
threshold sample size 7 7 9 11 14 17 18
Table 5: Relationship between the threshold sample size and the original haplotype
number (k = 10).
for providing software on their web pages. The work is fully supported by a grant from the
Research Grants Council of the Hong Kong Special Administrative Region, China [Project No.
CityU 1047/01E].
References
Bafna,V., Gusfield,D., Lancia,G. and Yooseph,S. (2002) Haplotyping as perfect phylogenty: A
direct approach. Technical Report UCDavis CSE-2002-21.
Chiano,M. and Clayton,D. (1998) Fine genetic mapping using haplotype analysis and the missingdata problem. Am. J. Hum. Genet., 62, 55-60.
Clark,A. (1990) Inference of haplotypes from PCR-amplified samples of diploid populations. Mol.
Biol. Evol., 7, 111-122.
Clark,A., Weiss,K., Nickerson,D., Taylor,S., Buchanan,A., Stengard,J., Salomaa,V., Vartiainen,E.,
Perola,M., Boerwinkle,E. and Sing,C. (1998) Haplotype structure and population genetic inferences
from nucleotide-sequence variation in human lipoprotein lipase. Am. J. Hum. Genet., 63, 595-612.
Drysdale,C., McGraw,D., Stack,C., Stephens,J., Judson,R., Nandabalan,K., Arnold,K., Ruano,G.
and Liggett,S. (2000) Complex promoter and coding region 2-adrenergic receptor haplotypes alter
receptor expression and predict in vivo responsiveness.Proc. Natl. Acad. Sci. USA
, 97, 10483-10488.
Daly,M., Rioux,J., Schaffner,S., Hudson,T. and Lander,E. (2001) High-resolution haplotype struc-
ture in the human genome. nature genetics, 29, 229-232.
Excoffier,L. and Slatkin,M. (1995) Maximum-likelihood estimation of molecular haplotype frequen-
cies in a diploid population. Mol. Biol. Evol., 12,921-927.
Gusfield,D. (2001) Inference of haplotypes from samples of diploid populations: complexity and
algorithms. J. Comput. Biol., 8, 305-323.
Gusfield, D. (2002) Haplotyping as perfect phylogeny: conceptual framework and efficient solutions.
RECOMB02.
Hawley,M. and Kidd,K. (1995) HAPLO: a program using the EM algorithm to estimate the fre-
quencies of multi-site haplotypes. J. Hered.,86,409-411.
Jin,L., Underhill,P., Doctor,V., Davis,R., Shen,P., Cavalli-Sforza,L. and Oefner,P. (1999) distribu-
tion of haplotypes from a chromosome 21 region distinguished multiple prehistoric human migra-
tions. Proc. Natl. Acad. Sci. USA, 96, 3796-3800.
Long,J., Williams,R. and Urbanek,M. (1995) An E-M algorithm and testing strategy for multi-locus
16
-
8/3/2019 2002Wang_Haplotype Inference by Maximum Parsimony
17/17
haplotypes. Am. J. Hum. Genet., 56, 799-810.
Niu,T., Qin,Z., Xu,X. and Liu,J. (2002) Bayesian haplotype inference for multiple linked single-
nucleotide polymorphisms. Am. J. Hum. Genet., 70, 157-169.
Rieder,M., Taylor,S., Clark,A. and Nickerson,D. (1999) Sequence variation in the human an-giotensin converting enzyme. nature genetics, 22, 59-62.
Stephens,M., Smith,N. and Donnelly,P. (2001) A new statistical Method for haplotype reconstruc-
tion. Am. J. Hum. Genet., 68, 978-989.
Wang,X. and Miao,J. (2002) In-silico haplotyping: state-of-the-art. UDEL technical reports.
Zhang,J., Vingron,M. and Hoehe,M. (2001) On haplotype reconstruction for diploid populations.
EURANDOM report 2001-026.
17