[ieee 2008 canadian conference on electrical and computer engineering - ccece - niagara falls, on,...

6
VARIABILITY OF HAPLOTYPE PHASE AND ITS EFFECT ON GENETIC ANALYSIS * Mohammed Uddin 1,2 , Mitch Sturge 1,2 , Courtenay Griffin 2 , Steve Benteau 2 , Proton Rahman 2 1 Department of Computer Science, Memorial University of Newfoundland 2 Population Therapeutic Research Group (PTRG), Faculty of Medicine, Memorial University of Newfoundland * Author Correspondence: [email protected] ABSTRACT One of the challenging problems toward understanding characteristics of the human genome is phasing haplotype information from genotype data. Among many proposed approaches towards solving the genotype phasing problem, PHASE and Expectation-Maximization (EM) algorithms are the most commonly used methods to obtain haplotype phase. This study has applied both methods to two important genetic regions – Interleukin-1 (IL-1) and Cystic Fibrosis (CF) – with 166 (spans 51kb) and 238 (spans 188kb) single nucleotide polymorphisms (SNPs) respectively, genotyped from 90 unrelated individuals. For analysis, the phased haplotypes from the PHASE and EM algorithms were used to observe the level of difference. The analyses found a significant difference in allele frequency (p < 0.001 for IL-1 and p < 0.05 for CF). In contrast to other comparison studies, the analysis was extended to find variability in common haplotype blocks. In comparison between the two algorithms, the performance of the PHASE algorithm is more robust than the EM-algorithm. 1. INTRODUCTION In understanding the human genome it is important to know the underlying structure of the haplotype map. The identification of a large number of SNPs increases the need of haplotype analysis [8]. It is a challenging task to construct a haplotype map of the human genome because accurately phasing haplotype information from genotype data is a major issue. Phased haplotype data is an important factor in the advancement of identifying disease associated genetic regions. In disease association studies, haplotype reveals more significant genetic variations than single marker associations. However, the power of any haplotype association study depends on the accuracy of phased haplotypes [9]. In computational complexity, phasing haplotype from genotype data is a NP-hard (no polynomial algorithm exist) problem [7]. To overcome this problem, ___________________________________ 978-1-4244-1643-1/08/$25.00 ©2008 IEEE many statistical methods have been developed that gives suboptimal results [9]. Different methods have been proposed to infer haplotype phase from genotype data. Each of these methods has their pros and cons relative to accuracy. The two leading phasing algorithms are the PHASE and the EM-algorithm. Fallin and Schork reported on the performance of the EM- algorithm and its estimation accuracy for biallelic loci [5]. Their comparison suggests that one of the most influential effects on estimation accuracy is the distribution of alleles. A more comprehensive performance comparison of four phasing algorithms was done with a data set of 15 SNPs [1]. The author reported the performance of the algorithms based on this small set of SNPs may not represent the true performance of the algorithms. The comparison was limited to linkage disequilibrium (LD) differences, haplotype accuracy and error estimation. In another study, five phasing algorithms were compared to quantify their accuracy [9]. This study was focused on the phasing performance of genome scale data, and gave an in-depth analysis of the phasing algorithm and its accuracy. The five methods were evaluated by estimating the correlation coefficient LD measure r 2 . The PHASE algorithm was found to be the most robust against all the other methods [9]. This algorithm is a Bayesian approach that applies coalescent-based models to improve phasing accuracy [12] and it uses a model to facilitate the decay of LD with distance. The running time of this algorithm is more expensive than the EM-algorithm based methods, however, the EM-algorithm has multiple variants [1]. The EM-algorithm has two common parts in all the variants - the expectation step and the maximization step. One of the most commonly used and reliable EM- algorithm was proposed by David Clayton and was implemented in the SNPHAP application [3]. The previous works compared phasing accuracy with an in-depth investigation of minor allele frequency and LD differences. In this study, not only do we analyze the performance of the PHASE and SNPHAP algorithms through minor allele frequency distribution and LD differences, but we also quantify the allele distribution difference and the common haplotype block variation in 000595

Upload: proton

Post on 27-Mar-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2008 Canadian Conference on Electrical and Computer Engineering - CCECE - Niagara Falls, ON, Canada (2008.05.4-2008.05.7)] 2008 Canadian Conference on Electrical and Computer

VARIABILITY OF HAPLOTYPE PHASE AND ITS EFFECT ON GENETIC ANALYSIS

*Mohammed Uddin1,2, Mitch Sturge1,2, Courtenay Griffin2, Steve Benteau2, Proton Rahman2

1Department of Computer Science, Memorial University of Newfoundland

2Population Therapeutic Research Group (PTRG), Faculty of Medicine, Memorial University of Newfoundland

*Author Correspondence: [email protected]

ABSTRACT One of the challenging problems toward understanding characteristics of the human genome is phasing haplotype information from genotype data. Among many proposed approaches towards solving the genotype phasing problem, PHASE and Expectation-Maximization (EM) algorithms are the most commonly used methods to obtain haplotype phase. This study has applied both methods to two important genetic regions – Interleukin-1 (IL-1) and Cystic Fibrosis (CF) – with 166 (spans 51kb) and 238 (spans 188kb) single nucleotide polymorphisms (SNPs) respectively, genotyped from 90 unrelated individuals. For analysis, the phased haplotypes from the PHASE and EM algorithms were used to observe the level of difference. The analyses found a significant difference in allele frequency (p < 0.001 for IL-1 and p < 0.05 for CF). In contrast to other comparison studies, the analysis was extended to find variability in common haplotype blocks. In comparison between the two algorithms, the performance of the PHASE algorithm is more robust than the EM-algorithm.

1. INTRODUCTION

In understanding the human genome it is important to know the underlying structure of the haplotype map. The identification of a large number of SNPs increases the need of haplotype analysis [8]. It is a challenging task to construct a haplotype map of the human genome because accurately phasing haplotype information from genotype data is a major issue. Phased haplotype data is an important factor in the advancement of identifying disease associated genetic regions. In disease association studies, haplotype reveals more significant genetic variations than single marker associations. However, the power of any haplotype association study depends on the accuracy of phased haplotypes [9]. In computational complexity, phasing haplotype from genotype data is a NP-hard (no polynomial algorithm exist) problem [7]. To overcome this problem, ___________________________________ 978-1-4244-1643-1/08/$25.00 ©2008 IEEE

many statistical methods have been developed that gives suboptimal results [9]. Different methods have been proposed to infer haplotype phase from genotype data. Each of these methods has their pros and cons relative to accuracy. The two leading phasing algorithms are the PHASE and the EM-algorithm. Fallin and Schork reported on the performance of the EM-algorithm and its estimation accuracy for biallelic loci [5]. Their comparison suggests that one of the most influential effects on estimation accuracy is the distribution of alleles. A more comprehensive performance comparison of four phasing algorithms was done with a data set of 15 SNPs [1]. The author reported the performance of the algorithms based on this small set of SNPs may not represent the true performance of the algorithms. The comparison was limited to linkage disequilibrium (LD) differences, haplotype accuracy and error estimation. In another study, five phasing algorithms were compared to quantify their accuracy [9]. This study was focused on the phasing performance of genome scale data, and gave an in-depth analysis of the phasing algorithm and its accuracy. The five methods were evaluated by estimating the correlation coefficient LD measure r2. The PHASE algorithm was found to be the most robust against all the other methods [9]. This algorithm is a Bayesian approach that applies coalescent-based models to improve phasing accuracy [12] and it uses a model to facilitate the decay of LD with distance. The running time of this algorithm is more expensive than the EM-algorithm based methods, however, the EM-algorithm has multiple variants [1]. The EM-algorithm has two common parts in all the variants - the expectation step and the maximization step. One of the most commonly used and reliable EM-algorithm was proposed by David Clayton and was implemented in the SNPHAP application [3]. The previous works compared phasing accuracy with an in-depth investigation of minor allele frequency and LD differences. In this study, not only do we analyze the performance of the PHASE and SNPHAP algorithms through minor allele frequency distribution and LD differences, but we also quantify the allele distribution difference and the common haplotype block variation in

000595

Page 2: [IEEE 2008 Canadian Conference on Electrical and Computer Engineering - CCECE - Niagara Falls, ON, Canada (2008.05.4-2008.05.7)] 2008 Canadian Conference on Electrical and Computer

order to see the haplotype phasing difference between these two methods.

2. MATERIAL AND METHODS 2.1. Data In this study, we have chosen two important genetic regions for analysis. The susceptibility of Interleukin-1 and Cystic Fibrosis genes with various diseases is well established [4, 11]. For this study, genotype data for IL-1 and CF region was taken from the International HapMap Consortium [8]. Both data sets were genotyped from a set of 90 individuals from Utah with ancestry from northern and western Europe. The Interleukin-1 data consists of 166 SNPs that are found on Chromosome 2 and spans a 51kb region. The Cystic Fibrosis data consists of 238 SNPs that are found on Chromosome 7 and spans a 188kb region. 2.2 Discrepancy Measures of Phased Haplotypes The mean squared error (MSE) and mean absolute error (MAE) were computed for the SNPs minor allele frequencies. MSE is a classical index for comparing frequency estimation. The MSE was calculated using

The MAE was calculated to capture the absolute difference by using

In both equations fri is the minor allele frequency and S is the number of SNPs in the dataset. The mean and standard deviation were calculated using SPSS. The level of significance of minor alleles in a different range was computed by chi-square tests. The two prominent measures of calculating LD is the D’ and r2. The Haploview application was used to calculate the LD measures and to compute the common haplotype blocks [2]. The details on linkage and haplotype block computation are given in the results section.

3. RESULTS AND DISCUSSION The PHASE and SNPHAP applications were used to get the haplotypes for each dataset. Parameter settings for PHASE (v2.1) were 100 burn-in and 100 runs, while default settings were used for SNPHAP. The analysis carried out in this study used only phased haplotypes, and each dataset (IL-1 and CF) produced two sets. For each individual the most probable pair of haplotype phase was considered for the analysis. Prior to our analysis we checked the deviation

from Hardy-Weinberg Equilibrium (HWE) for each SNP with a cut-off p < 0.001. 3.1. Minor Allele Frequency Deviation To assess the inconsistency of haplotype phases between the two algorithms we first observed the variability of allele frequency differences. The ratio of minor alleles is higher in IL-1 data than in CF data. The mean and the standard deviation of minor allele frequency for both methods did not show any significant difference between IL-1 and CF data (Table 1). The MSE measure captures the overall difference in minor allele frequencies between PHASE and SNPHAP for a particular dataset. The additional measure of error calculated is the MAE, which captures the absolute difference or bias of each minor allele frequency within a dataset [5]. The dataset with an increased number of minor alleles showed less error in both MSE and MAE measures. Data Algorithm Mean SD MSE MAE

PHASE 0.193 0.155 Interleukin-1 SNPHAP 0.202 0.155

9.25E-04 0.011

PHASE 0.139 0.165 Cystic Fibrosis SNPHAP 0.137 0.159

7.88E-04 0.017

Table 1: Mean, Standard Deviation (SD), Mean Squared Error (MSE) and Mean Absolute Error (MAE) of two algorithms in IL-1 and CF data. The ratio of minor alleles for each data set was computed in three different ranges of allele frequencies (Table 2). The number of markers with frequency of < 0.5 in IL-1 data is 28% and 26% whereas in the CF data it is 52% and 44% for PHASE and SNPHAP haplotypes respectively. The distribution of minor allele frequency for the two methods showed significant differences in both data sets (p < 0.001 in IL-1, p < 0.05 in CF). ________________________________________________ No. (PROPORTION) OF SNPs WITH ALLELE FREQUENCY ______________________________ q = 0 0 < q <= 0.05 q > 0.05 ________________________________________________ Interleukin-1 (PHASE) 46 (0.28) 0 (0.0) 120(0.72) Interleukin-1 (SNPHAP) 27 (0.16) 17 (0.10) 122(0.74) p-value < 0.001 ________________________________________________ Cystic Fibrosis (PHASE) 123 (0.52) 0 (0.0) 115 (0.48) Cystic Fibrosis (SNPHAP) 100 (0.42) 5 (0.02) 133 (0.56) p-value < 0.05 ________________________________________________ Table 2: SNPs (and ratio) for each range of minor allele frequencies for both data sets.

000596

Page 3: [IEEE 2008 Canadian Conference on Electrical and Computer Engineering - CCECE - Niagara Falls, ON, Canada (2008.05.4-2008.05.7)] 2008 Canadian Conference on Electrical and Computer

Fig 1. Physical distance (X-axis) and estimated pair wise r2 (Y-axis) for each data sets. The decay of linkage disequilibrium between all pairs of SNP loci in the two regions, shown as a function of the distance between the loci.

3.2. Linkage Disequilibrium The deviation in LD patterns is one of the key elements in evaluating the differences between these two haplotype phase methods. The pattern of LD for a data set depends on the allele distribution and accuracy of the haplotype phase [5]. The LD information is essential for tag SNP selection and modeling haplotype association tests. Association studies based on LD will be affected by the inaccuracy of phased haplotypes. The pair wise LD analysis of the data sets was performed on the phased haplotypes. SNPs with zero minor allele frequencies were excluded from the linkage disequilibrium analysis. The density of SNPs in both data

Fig 2. Proportion of SNP pairs that are in strong LD.

sets are high. There is no significant difference observed in the average D’ value between marker pairs of the two data sets. The average D’ in IL-1 data is 0.96 and 0.97 calculated from the phased haplotypes of the two methods. The average D’ in CF data is 0.91 and 0.92. To quantify the ratio of SNP pairs that are in strong LD, we adopted the definition of “strong LD” where the one sided upper 95% CI bound on D’ between a pair of SNP is > 0.98 [6]. The ratio of strong LD of both data sets is significantly different in PHASE haplotype results than in SNPHAP (Fig 2). This difference in LD will dictate a different number of tag SNPs for a data set if one uses these two methods to infer haplotypes. To gain a sense of the difference in the LD patterns and in the actual estimation of haplotypes by the different methods we have computed a pair wise correlation coefficient r2. This method of LD computation cannot be computed from genotypes; instead it can only be computed directly from haplotypes (ref: marchini). In IL-1 data the average r2 is 0.42 and 0.34 for PHASE and SNPHAP inferred haplotypes respectively. The CF data also shows difference in average r2, 0.37 and 0.29 (Fig 1). The variation in minor allele frequencies stated in the previous section is also the reason of the observed deviation in LD. This analysis did not assess the accuracy measure of the two methods; instead it reveals the differences between them.

000597

Page 4: [IEEE 2008 Canadian Conference on Electrical and Computer Engineering - CCECE - Niagara Falls, ON, Canada (2008.05.4-2008.05.7)] 2008 Canadian Conference on Electrical and Computer

Fig 3. Haplotype blocks. A, 2 haplotype blocks in IL-1 data with haplotype phases from PHASE algorithm; B, 1 haplotype block in IL-1 data with haplotype phases from SNPHAP algorithm; C, 2 haplotype blocks in CF data with haplotype phases from PHASE algorithm; D, 2 haplotype blocks in CF data with haplotype phases from SNPHAP algorithm. 3.3. Haplotype Blocks The PHASE algorithm performed consistently and inferred haplotypes for each of the 90 individuals in the two data sets. However, the SNPHAP algorithm inferred haplotypes for 73 and 67 individuals for the IL-1 and CF data respectively. This is a substantial loss of power for the analysis of linkage disequilibrium and haplotype blocks. It is important to know the underlying haplotype structure of the human genome as haplotype methods have revealed genes for common and complex diseases [6]. Evidence suggests that haplotypes have a block-like structure in the human genome. Only a few numbers of common haplotypes can define the entire dataset [10]. The definition of a common haplotype block was adapted from Gabriel et al. where a block is generated if 95% informative comparisons are in “strong LD” [6]. The IL-1 data using PHASE haplotypes found that 2 blocks consist of 15 common haplotypes. The IL-1 data using SNPHAP haplotypes found 1 block with 12 common haplotypes. Similarly, the CF data using PHASE haplotypes found 2 blocks with 19 common haplotypes, and using SNPHAP haplotypes found 2 blocks with 20 common haplotypes. In both data sets the haplotype phased by the two algorithms detected different blocks with a different number of common haplotypes. These differences in haplotype blocks will lead us to different haplotype maps in these two regions, and differences in mapping make it problematic for identifying genes of various diseases [6].

4. CONCLUSIONS This study demonstrates the usage of haplotype phase for population genetic analysis. It shows the dramatic effect of different analyses that are based on haplotypes. Different algorithms do produce different results with variable ranges in accuracy. In this study, we have analyzed two data sets with different LD structure. The IL-1 region is highly linked whereas the Cystic Fibrosis region is relatively discrete. SNPHAP failed to infer haplotypes for all the individuals in

both data sets and this insufficient inference of haplotypes will significantly influence the power of statistical analysis. On the other hand, PHASE algorithm shows more robustness and infers haplotypes for all the individuals in both data sets. The future roadmap is to investigate the robustness and accuracy of the two algorithms by performing a genome wide comparison in support of this study.

5. ACKNOWLEDGEMENT

We would like to thank Dr. Michael Nothnagel at the Christian Albrechts University, IMIS, Germany, for valuable suggestions. The authors would also like to thank Matthew Stephens and David Clayton for providing us the application for haplotype phase. This work was funded by Atlantic Canada Opportunities Agency – Atlantic Innovation Fund, Canada.

5. REFERENCES

[1] R.M. Adkins, “Comparison of the accuracy of methods of computational haplotype inference using a large empirical dataset,” BMC Genetics, vol.5:22, 2001. [2] J.C. Barrett, B. Fry, J. Maller, and M.J. Daly, “Haploview: analysis and visualization of LD and haplotype maps”, Bioinformatics., vol. 21, pp. 263 – 265, 2004. [3] D. Clayton, “SNPHAP: A program for estimating frequencies of large haplotypes of SNPs,” In 1.0 ed. Cambridge: Department of Medical Genetics, Cambridge Institute for Medical Research, 2003. http://portal.litbio.org/Registered/Help/snphap/ [4] J. Davies, E. Alton, U. Griesenbach, “Cystic fibrosis modifier genes”, Bioinformatics., vol. 21, pp. 263 – 265, 2004. [5] D. Fallin, and N.J. Schork, “Accuracy of haplotype estimation for biallelic loci, via the Expectation-

000598

Page 5: [IEEE 2008 Canadian Conference on Electrical and Computer Engineering - CCECE - Niagara Falls, ON, Canada (2008.05.4-2008.05.7)] 2008 Canadian Conference on Electrical and Computer

Maximization algorithm for unphased diploid genotype data,” Am. J. Hum. Genet., vol. 67, pp. 947 - 959, 2000. [6] S.B. Gabriel et al., “The structure of haplotype blocks in the human genome,” Science, vol. 296, pp. 2225 – 2229, 2002. [7] D. Gusfield, “An overview of combinatorial methods for haplotype inference,” In Computational Methods for SNPs and Haplotype Inference, vol. 2983, pp. 9 – 25, 2004. [8] The International HapMap Consortium., “A second generation human haplotype map of over 3.1 million SNPs,” Nature Rev. Genet., vol. 449, pp. 851 – 862, 2007. [9] J. Marchini, D. Cutler, N. Patterson, M. Stephens, E. Eskin, E. Halperin, S. Lin, Z.S. Qin, H.M. Munro, G.R. Abecasis, and P. Donnelly, “A comparison of phasing

algorithms for trios and unrelated individuals,” Am. J. Hum. Genet., vol. 78, pp. 437 - 450, 2006. [10] N. Patil et al., “Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21,” Science., vol. 294, pp. 1719 – 1723, 2001. [11] P. Rahman, S. Sun, L. Peddle, T. Snelgrove, W. Melay, C. Greenwood, D. Greenwood, “Association between the Interleukin-1 family gene cluster and Psoriatic Arthritis” Arthritis & Rheumatism., vol. 54, pp. 2321 – 2325, 2006. [12] M. Stephens, N. Smith, and P. Donnelly, “A new statistical method for haplotype reconstruction from population data,” Am. J. Hum. Genet., vol. 68, pp. 978 – 989, 2001.

000599

Page 6: [IEEE 2008 Canadian Conference on Electrical and Computer Engineering - CCECE - Niagara Falls, ON, Canada (2008.05.4-2008.05.7)] 2008 Canadian Conference on Electrical and Computer

Intentional Blank Page

000600