chapter 12 statistical genetics · 2017-08-28 · chapter 12 statistical genetics darlene r....
TRANSCRIPT
Chapter 12Statistical Genetics
Darlene R. Goldstein
Terry Speed has produced many interesting and important contributions to thefield of statistical genetics, with work encompassing both experimental crosses andhuman pedigrees. He has been instrumental in uncovering and elucidating algebraicstructure underlying a diverse range of statistical problems, providing new and uni-fying insights.
Here, I provide a brief commentary and introduction to some of the key buildingblocks for an understanding of the papers. Some readers may also find useful arefresher on group action (see e.g. Fraleigh [6]) and hidden Markov models [13].
Linkage mapping
Linkage analysis studies inheritance of traits in families, with the aim of determiningthe chromosomal location of genes influencing the trait. The analysis proceeds bytracking patterns of coinheritance of the trait of interest and other traits or geneticmarkers, relying on the varying degree of recombination between trait and markerloci to map the loci relative to one another.
A measure of the degree of linkage is the recombination fraction θ , the chance ofrecombination occurring between two loci. For unlinked genes, θ = 1/2; for linkedgenes, 0 ≤ θ < 1/2.
D.R. GoldsteinInstitut de mathematiques d’analyse et applications, Ecole Polytechnique Federalede Lausanne, and Swiss Institute of Bioinformatics, Switzerlande-mail: [email protected]
S. Dudoit (ed.), Selected Works of Terry Speed, Selected Works in Probability and Statistics,DOI 10.1007/978-1-4614-1347-9 12,
471© Springer Science+Business Media, LLC 2012
472 D.R. Goldstein
Human pedigrees
S. Dudoit and T. P. Speed (1999). A score test for linkage using identity by descentdata from sibships. Annals of Statistics 27:943–986.
This paper offers a novel and comprehensive algebraic view of sib-pair meth-ods, fundamentally unifying a large collection of apparently ad hoc procedures andproviding powerful insights into the methods.
Identical by descent allele sharing
Data for linkage analysis consist of sets of related individuals (pedigrees) and infor-mation on the genetic marker and/or trait genotypes or phenotypes. The recombina-tion fraction is most commonly estimated by maximum likelihood for an appropriategenetic model for the coinheritance of the loci.
However, likelihood-based linkage analysis can be difficult to carry out due tothe problem that the mode of inheritance may be complex and in any case is usuallyunknown. Nonparametric approaches are thus appealing, since they do not require agenetic inheritance model to be specified. Such methods typically focus on identicalby descent (IBD) allele sharing at a locus between a pair of relatives. DNA at a locusis shared by two relatives identical by descent if it originated from the same ances-tral chromosome. In families of individuals possessing the trait of interest, there isassociation between the trait and allele sharing at loci linked to trait susceptibilityloci, which can be used to localize trait susceptibility genes.
Testing for linkage with IBD data has developed differently, depending on thetype of trait. For qualitative traits, tests are based on IBD sharing conditional onphenotypes. Affected sib-pair methods are a popular choice; these are often de-scribed as nonparametric since the mode of inheritance does not need to be specified(see Hauser and Boehnke [10] for a review). On the other hand, for quantitative traitloci (QTL), linkage analysis is based on examination of phenotypes conditional onsharing (for example, the method of Haseman and Elston [9] or one of its manyextensions).
Inheritance vector
The pattern of IBD sharing at a locus within a pedigree is summarized by an inher-itance vector, which completely specifies the ancestral source of DNA. For sibshipsof size k, label locus (1, 2) and maternally derived alleles (3, 4). The inheritancevector at a given locus is the vector x = (x1,x2, ...,x2k−1,x2k), where for sib i, x2i−1
is the label of the paternally inherited allele (1 or 2) and x2i is that of the maternallyinherited allele (3 or 4) at the locus.
12 Statistical Genetics 473
For a pair of sibs, when paternal and maternal allele sharing are not distinguished,the 16 possible inheritance vectors give rise to three IBD configurations Cj: the sibsmay share 0, 1, or 2 alleles IBD at the locus (Table 12.1). The IBD configurationscan be thought of as orbits of groups acting on the set of possible inheritance vectorsX [2].
Table 12.1 Sib-pair IBD configurations
Alleles IBD Inheritance vectors |Cj |0 IBD (1, 3, 2, 4), (1, 4, 2, 3), (2, 3, 1, 4), (2, 4, 1, 3) 41 IBD (paternal) (1, 3, 1, 4), (1, 4, 1, 3), (2, 3, 2, 4), (2, 4, 2, 3) 8
(maternal) (1, 3, 2, 3), (1, 4, 2, 4), (2, 3, 1, 3), (2, 4, 1, 4)2 IBD (1, 3, 1, 3), (1, 4, 1, 4), (2, 3, 2, 3), (2, 4, 2, 4) 4
Score test for linkage
The literature contains several proposed tests of the null hypothesis of no linkage(H : θ = 1/2) based on score functions of IBD configurations for sibships and otherpedigrees, with scores chosen to yield good power against a particular alternative.The score test of Dudoit and Speed to detect linkage represents a major break-through in that it creates a coherent, unified based approach to the linkage analysisof qualitative and quantitative traits using IBD data. The likelihood for the recom-bination fraction θ , conditional on the phenotypes of the relatives, is used to form ascore test of the null hypothesis of no linkage (θ = 1/2).
The probability vector of IBD configurations, conditional on pedigree pheno-types, at a marker locus linked to a trait susceptibility locus at recombination frac-tion θ can be written as ρ(θ ,π)1×m = π1×mT (θ )m×m, where π represents the con-ditional probability vector for IBD configurations at the trait locus and the numberof IBD configurations is m. T (θ ) denotes the transition matrix between IBD config-urations at loci separated by recombination fraction θ .
In general, the probability vector π depends on unknown genetic parameters.However, using their formulation of the problem, Dudoit and Speed [4] show rig-orously that for affected sibships of a given size, the second-order Taylor seriesexpansion of the log likelihood around the null of no linkage is independent of thegenetic inheritance model. They thus provide a mathematically justified basis foraffected sib-pair methods, which do not require an assumed mode of inheritance.
Practical advantages of the score test include: it is locally most powerful foralternatives close to the null; any genotype distribution can be used (i.e., Hardy-Weinberg equilibrium is not required); conditioning on phenotypes eliminates se-lection bias introduced by nonrandom ascertainment; and combining differently as-certained pairs is straightforward, providing the important benefit of allowing us to
474 D.R. Goldstein
avoid discarding any data. For many realistic simulation scenarios [7, 8], the scoretest proves to be robust and shows large power gains over commonly used nonpara-metric tests.
Although the paper focuses on pairs of sibs, the same score test approach is alsoapplicable to any set of relatives [3].
Experimental crosses
N. J. Armstrong, M. S. McPeek and T. P. Speed (2006). Incorporating interferenceinto linkage analysis for experimental crosses. Biostatistics 7:374–386.
This paper improves multilocus linkage analysis of experimental crosses by in-corporating a realistic model of crossover interference, and implementing it byextending the Lander-Green algorithm for genetic reconstruction. It represents theculmination of a series of studies of the modeling of crossover interference.
χ2 model of crossover interference
During the (four-strand) process of crossing over in meiosis, two types of inter-ference (nonindependence) are distinguished: chromatid interference, a situation inwhich the occurrence of a crossover between any pair of nonsister chromatids af-fects the probability of those chromatids being involved in other crossovers in thesame meiosis; and crossover interference, which refers to nonrandom location ofchiasmata along a chromosome.
Most genetic mapping is carried out assuming independence; that is, nochromatid interference and no crossover interference. This assumption simplifieslikelihood calculations. Although there is little empirical evidence for chromatidinterference, there is substantial evidence of crossover interference. Thus, more amore realistic model incorporating crossover interference should be able to providemore accurately estimated genetic maps.
The χ2 model of crossover interference [5] provides a dramatically improvedfit over a wide range of models [12, 14]. This model assumes that recombinationintermediates (structures formed after initiation of recombination) are resolved inone of two ways: either with or without crossing over. Recombination initiationevents are assumed to occur according to a Poisson distribution, but constraints onthe resolution of intermediates creates interference. The χ2 model assumes m unob-served intermediates between each crossover. For m = 1, the model reduces to theno (crossover) interference model. This model is a special case of the more generalgamma model, but has the advantage of being computationally more feasible.
12 Statistical Genetics 475
Genetic reconstruction and the Lander-Green algorithm
Genetic mapping in humans can be viewed as a missing data problem, since we aretypically unable to observe the complete data (the number of recombinant and non-recombinant meioses for each interval). If complete data were available, maximumlikelihood estimates of a set of recombination fractions θi, i = 1, . . . ,T − 1, for ad-jacent markers M1, . . . ,MT would just be the observed proportion of recombinantsin an interval.
The genetic reconstruction problem is to determine the expected number of re-combinations that occurred in intervals of adjacent markers, given genotypes at mul-tiple marker loci in a pedigree and the recombination fraction for each interval. Con-struction is straightforward when there is complete genotype information, includingthe ancestral origin (paternal or maternal).
More commonly this information is not known, so a different strategy for likeli-hood calculation is needed to obtain recombination fraction estimates. Lander andGreen [11] proposed an approach based on the use of inheritance vectors. Theyshowed that the probability of the observed data can be calculated for any particularinheritance vector and that under no crossover interference, the inheritance vectorsform a Markov chain along the chromosome. They model the pedigree and data as ahidden Markov model, where the hidden states are the (unobserved) inheritance vec-tors. The complexity of their algorithm for calculating likelihoods increases linearlywith the number of markers but exponentially in pedigree size, making it appropriatefor analysis of many markers on small to moderately sized pedigrees.
In experimental crosses, mapping is generally more straightforward since inves-tigators can arrange crosses to produce complete data. However, the presence ofunobserved recombination initiation points creates a new kind of missing data whenthe no interference model is not assumed. The creative insight of Armstrong et al.[1] is to model the crossover interference process as a hidden Markov model. Thisstep works because even though crossovers resulting from initiation events do notoccur independently (in the presence of crossover interference), the initiation eventsthemselves are assumed to be independent. Armstrong et al. [1] are thus able toextend the Lander-Green algorithm to incorporate interference according to the χ2
model, thereby providing more accurately estimated genetic maps.
Conclusion
Terry’s work in statistical genetics has identified underlying commonalities acrossseemingly disparate procedures, contributing meaningful theoretical and practicalimprovements. An impressive aspect of these works is the fresh perspective offeredby viewing the problems at a stripped-down, fundamental level. Applying an excep-tional combination of extensive mathematical expertise and pragmatic sensibility,Terry provides inventive solutions and a richer structural understanding of signifi-cant questions in statistical genetics.
476 D.R. Goldstein
References
[1] N. J. Armstrong, M. S. McPeek, and T. P. Speed. Incorporating interferenceinto linkage analysis for experimental crosses. Biostatistics, 7(3):374–386,2006.
[2] K. P. Donnelly. The probability that related individuals share some section ofgenome identical by descent. Theor. Popul. Biol., 23:34–63, 1983.
[3] S. Dudoit. Linkage Analysis of Complex Human Traits Using Identity by De-scent Data. PhD thesis, Department of Statistics, University of California,Berkeley, 1999.
[4] S. Dudoit and T. P. Speed. A score test for linkage using identity by descentdata from sibships. Ann. Stat., 27(3):943–986, 1999.
[5] E. Foss, R. Lande, F. Stahl, and C. Steinberg. Chiasma interference as a func-tion of genetic distance. Genetics, 133:681–691, 1993.
[6] J. B. Fraleigh. A First Course in Abstract Algebra. Addison-Wesley Pub. Co.,Reading, Mass., 7th edition, 2002.
[7] D. R. Goldstein, S. Dudoit, and T. P. Speed. Power of a score test for quanti-tative trait linkage analysis of relative pairs. Genet. Epidemiol., 19(Suppl. 1):S85–S91, 2000.
[8] D. R. Goldstein, S. Dudoit, and T. P. Speed. Power and robustness of a scoretest for linkage analysis of quantitative trait based on identity by descent dataon sib pairs. Genet. Epidemiol., 20(4):415–431, 2001.
[9] J. K. Haseman and R. C. Elston. The investigation of linkage between a quan-titative trait and a marker locus. Behav. Genet., 2:3–19, 1972.
[10] E. R. Hauser and M. Boehnke. Genetic linkage analysis of complex genetictraits by using affected sibling pairs. Biometrics, 54:1238–1246, 1998.
[11] E. S. Lander and P. Green. Construction of multilocus genetic maps in humans.Proc. Natl. Acad. Sci. USA, 84:2363–2367, 1987.
[12] M. S. McPeek and T. P. Speed. Modeling interference in genetic recombina-tion. Genetics, 139:1031–1044, 1995.
[13] L. R. Rabiner. A tutorial on hidden Markov models and selected applicationsin speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
[14] H. Zhao, T. P. Speed, and M. S. McPeek. Statistical analysis of crossover in-terference using the chi-square model. Genetics, 139:1045–1056, 1995.
12 Statistical Genetics 477
478 12 Statistical Genetics
12 Statistical Genetics 479
480 12 Statistical Genetics
12 Statistical Genetics 481
482 12 Statistical Genetics
12 Statistical Genetics 483
484 12 Statistical Genetics
12 Statistical Genetics 485
486 12 Statistical Genetics
12 Statistical Genetics 487
488 12 Statistical Genetics
12 Statistical Genetics 489
490 12 Statistical Genetics
12 Statistical Genetics 491
492 12 Statistical Genetics
12 Statistical Genetics 493
494 12 Statistical Genetics
12 Statistical Genetics 495
496 12 Statistical Genetics
12 Statistical Genetics 497
498 12 Statistical Genetics
12 Statistical Genetics 499
500 12 Statistical Genetics
12 Statistical Genetics 501
502 12 Statistical Genetics
12 Statistical Genetics 503
504 12 Statistical Genetics
12 Statistical Genetics 505
506 12 Statistical Genetics
12 Statistical Genetics 507
508 12 Statistical Genetics
12 Statistical Genetics 509
510 12 Statistical Genetics
12 Statistical Genetics 511
512 12 Statistical Genetics
12 Statistical Genetics 513
514 12 Statistical Genetics
12 Statistical Genetics 515
516 12 Statistical Genetics
12 Statistical Genetics 517
518 12 Statistical Genetics
12 Statistical Genetics 519
520 12 Statistical Genetics
12 Statistical Genetics 521
522 12 Statistical Genetics
12 Statistical Genetics 523
524 12 Statistical Genetics
12 Statistical Genetics 525
526 12 Statistical Genetics
12 Statistical Genetics 527
528 12 Statistical Genetics
12 Statistical Genetics 529
530 12 Statistical Genetics
12 Statistical Genetics 531
532 12 Statistical Genetics
12 Statistical Genetics 533