a new admixture model for inference of population

Upload: goldspotter9841

Post on 03-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 A New Admixture Model for Inference of Population

    1/8

    mStruct: A New Admixture Model for Inference of Population

    Structure in Light of Both Genetic Admixing and Allele Mutations

    Suyash Shringarpure [email protected]

    Eric P. Xing [email protected]

    School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 USA

    Abstract

    Traditional methods for analyzing popula-tion structure, such as the Structure pro-gram, ignore the influence of mutational ef-fects. We propose mStruct, an admixtureof population-specific mixtures of inheritancemodels, that addresses the task of structure

    inference and mutation estimation jointlythrough a hierarchical Bayesian framework,and a variational algorithm for inference. Wevalidated our method on synthetic data, andused it to analyze the HGDP-CEPH cell linepanel of microsatellites used in (Rosenberget al., 2002) and the HGDP SNP data usedin (Conrad et al., 2006). A comparison ofthe structural maps of world populations esti-mated by mStructand Structureis presented,and we also report potentially interestingmutation patterns in world populations es-timated by mStruct, which is not possible byStructure.

    1. Introduction

    The deluge of genomic polymorphism data, such asthe genome-wide multilocus genotype profiles of vari-able number of tandem repeats (i.e., microsatellites)and single nucleotide polymorphisms (i.e., SNPs), hasfueled the long-standing interest in analyzing patternsof genetic variations.to reconstruct the ancestral struc-tures of modern human populations, because such ge-netic ancestral information can shed light on the evo-

    lutionary history of modern populations and provideguidelines for more accurate association studies andother population genetics problems.

    One of the state-of-the-art methods for populationstructure analysis based on multilocus genotype data

    Appearing in Proceedings of the 25th International Confer-ence on Machine Learning, Helsinki, Finland, 2008. Copy-right 2008 by the author(s)/owner(s).

    Figure 1. Population structural map inferred by Structureon HapMap data consisting of 4 populations. The colorsrepresent different populations

    is the program Structure, whose basic form is basedon a statistical formalism known as the admixturemodel (Pritchard et al., 2000). Admixtures are in-stances of a more general class of hierarchical Bayesianmodels known as mixed membership models(Eroshevaet al., 2004), which postulate that genetic markers ofeach individual are iid (Pritchard et al., 2000) or spa-tially coupled (Falush et al., 2003) samples from multi-ple population-specific fixed-dimensional multinomialdistributions (which we will call allele frequency pro-

    files (Falush et al., 2003), or AP) of marker alleles.Under this assumption, the admixture model identi-fies each ancestral population by a specific AP (thatdefines a unique allele frequency distribution for each

    ancestral population for each marker) and displays thefraction of contributions from each AP in a modernindividual chromosome as a structural map. Figure 1shows an example of a structural map of four modernpopulations inferred from a portion of the HapMapmulti-population dataset by Structure. In this popu-lation structural map, each individual is representedas a thin vertical line which shows the fraction of theindividuals chromosome which originated from eachancestral population, as given by a unique AP. Thismethod has been successfully applied to human ge-netic data in (Rosenberg et al., 2002) and has unrav-eled impressive patterns in the genetic structures of

    world population.

    However, since an AP merely represents the frequencyof alleles in an ancestral population, rather than theactual allelic content or haplotypes of the alleles them-selves, the admixture models developed so far basedon AP do not model genetic changes due to muta-tions from the ancestral alleles. Indeed, a serious pit-fall of the model underlying Structure, as pointed outin (Excoffier & Hamilton, 2003), is that there is no

  • 7/28/2019 A New Admixture Model for Inference of Population

    2/8

    mStruct: Structure with mutations

    mutation model for modern individual alleles with re-spect to hypothetical common prototypes in the ances-tral populations, i.e, every unique allele in the mod-ern population is assumed to have a distinct ances-tral frequency, rather than allowing the possibility ofit just being a descendent of some common ancestralallele. Thus, while Structure aims to provide ancestryinformation for each individual and each locus, thereis no explicit representation of ancestors as a phys-ical set of founding alleles. Therefore, the inferredpopulation structural map emphasizes revealing thecontributions of abstract population-specific allele fre-quency profiles, which does not directly reflect indi-vidual diversity or the extent of genetic changes withrespect to the founders. Therefore, Structure does notenable inference of the founding genetic patterns, theage of the founding alleles, or the population diver-gence time (Excoffier & Hamilton, 2003).

    Another important issue in determining population

    structure is to look for the presence of admixture, abasic assumption of the Structure model. However,as we shall see later, on the HGDP data, it producesresults that cluster individuals cleanly into one allelefrequency profile or the other, thus leading us to con-clude that there was little or no admixture betweenthe human populations. While such a partitioning ofindividuals would be desirable for clustering them intogroups, it does not offer us any biological insight intothe intermixing of the populations.

    In this paper, we present mStruct (for structure un-der mutations), based on an admixture of population-

    specific mixtures of inheritance model (AdMim). Ad-Mim is an admixture of mixtures model, which rep-resents each ancestral population as a mixture of an-cestral alleles each with its own inheritance process,and each modern individual as an ancestry propor-tion vector (ancestry vector or map vector) that in-dicates membership proportions among the ancestralpopulations. By a simple but important extensionto the LDA-like (Blei et al., 2003) admixture modelused by Structure, mStruct facilitates estimation ofboth the structural map of populations (incorporat-ing mutations) and the mutation rates of either SNPor microsatellite alleles. A new variational inference

    algorithm was developed for inference and learning.We compare our method with Structureon both syn-thetic genotype data, and on the microsatellite andSNP genotype data of world populations (Rosenberget al., 2002; Conrad et al., 2006). Our results show thepresence of significant levels of admixture among thefounding populations. We also report interesting ge-netic divergence in world populations revealed by themutation patterns we estimated.

    2. The statistical model

    2.1. Representation of Populations

    To reveal the genetic composition of each modern in-dividual in terms of contributions from hypotheticalancestral populations via statistical inference on mul-tilocus genotype data, one must first choose an appro-

    priate representation of ancestral populations. Below,we begin with a brief description of a commonly usedmethod, followed by a new method that we propose.

    2.1.1. Population-Specific Allele FrequencyProfiles

    Due to the polymorphic nature of genetic markers, anintuitive statistic to characterize a population is thefrequencies of all observed alleles at all loci. For ex-ample, we can represent an ancestral population k bya unique set of population-specific multinomial dis-tributions, k {ki ; i = 1 : I}, where

    ki =[ki,1, . . . ,

    ki,L

    i] is the vector of multinomial parame-

    ters, also known as the allele frequency profile (Falushet al., 2003), or AP, of the allele distribution at locusi in ancestral population k; Li denotes the total num-ber of observed marker alleles at locus i, and Idenotesthe total number of marker loci. This representation,known as population-specific ancestry proportion pro-

    file, is used by the program Structure.

    2.1.2. Population-Specific Mixtures of AncestralAlleles

    A problem with the population-specific AP profile rep-resentation is that it ignores the possibility of muta-tions underlying the alleles observed in modern popu-

    lations with respect to their ancestral alleles. To cap-ture this, we propose to represent a population bya genetically more realistic statistical model knownas the population-specific mixtures of ancestral alle-les (MAA). For each locus i, an MAA for ancestral

    population k is a triple {ki , ki ,ki } consisting of a set

    of ancestral (or founder) alleles ki = (ki,1, . . . ,

    ki,Li

    ),which can differ from their descendent alleles in themodern population; a mutation rate ki associated withthe locus, which can be further generalized to be allele-specific if necessary; and an AP ki which now repre-sents the frequencies of the ancestral alleles. Here Lidenotes the total number of ancestral alleles at loci i.

    An MAA is strictly more expressive than an AP, be-cause the incorporation of a mutation model helps tocapture details about the population structure whichan AP cannot; and the MAA reduces to the AP whenthe mutation rates become zero and the founders areidentical to their descendents. As we show shortly,with an MAA, one can examine the mutation param-eters corresponding to each ancestral population via

  • 7/28/2019 A New Admixture Model for Inference of Population

    3/8

    mStruct: Structure with mutations

    Bayesian inference from genotype data; this might en-able us to infer the age of alleles, and also estimatepopulation divergence times.

    Let i {1, . . . , I } index the position of a locus in thestudy genome, n {1, . . . , N } index an individual inthe study population, and e {0, 1} index the two

    possible parental origin of an allele (in this study wedo not require strict phase information of the two al-leles, so the index e is merely used to indicate diploiddata). Under an MAA specific to an ancestral popu-lation k, the correspondence between a marker alleleXi,ne and a founder

    ki,l

    ki is not directly observ-

    able. For each allele founder ki,l, we associate with

    it an inheritance model p(|ki,l, ki,l) from which de-

    scendants can be sampled. Then, given specificationsof the ancestral population from which Xi,ne is de-rived (denoted by hidden indicator variable Zi,ne), theconditional distribution of Xi,ne under MAA followsa mixture of population-specific inheritance model:p(xi,ne = l

    | Zi,ne = k) =

    Ll=1

    ki,lp(xi,ne |

    ki,l,

    ki,l).

    Comparing to the counterpart of this function underAP: p(xi,ne = l

    | Zi,ne = k) = ki,l , we can see that

    the latter cannot explicitly model allele diversities interms of molecular evolution from the founders.

    2.2. A New Admixture Model for Population

    Structure

    The concept of admixture arises when modeling ob-jects (e.g., human beings) each comprising multipleinstances of some attributes (e.g., marker alleles), eachof which comes from a (possibly different) source dis-

    tribution Pk(|k), according to an individual-specificadmixing coefficient vector (a.k.a. map vector) .The map vectorrepresents the normalized contributionfrom each of the source distributions {Pk ; k = 1 : K}to the study object. For example, for every individ-ual, the alleles at all marker loci may be inherited fromfounders in different ancestral populations, each repre-sented by a unique distribution of founding alleles andthe way they can be inherited. Formally, this scenariocan be captured in the following generative process:

    1. For each individual n, draw the admixing vector:n P(|), where P(|) is a pre-chosen map prior.

    2. For each marker allele xi,ne xn

    2.1: draw the latent ancestral-population-origin

    indicator zi,ne Multinomial(| n);

    2.2: draw the allele xi,ne |zi,ne = k Pk(|k).

    As discussed in the previous section, an ancestral pop-ulation can be either represented as an AP or as anMAA. These two different representations lead to twodifferent probability distributions for Pk(|k) in thelast sampling step above, and thereby two differentadmixtures of very different characteristics.

    2.2.1. The existing model

    In Structure, the ancestral populations are representedby a set of population-specific APs. Thus the distri-bution Pk(|k) from which an observed allele can besampled is a multinomial distribution defined by therates of all observed alleles in the ancestral popula-tion, i.e., xi,n

    e

    |zi,ne

    = k Multinomial(|ki

    ). Us-ing this probability distribution in the general admix-ture scheme outlined above, we can see that Structureessentially implements an admixture of population-specific al lele ratesmodel. But a serious pitfall of usingsuch a model, as pointed out in (Excoffier & Hamilton,2003), is that there is no error model for individual al-leles with respect to the common prototypes, i.e, everyunique measurement at a particular allele is assumedto be a new allele, rather than allowing the possibilityof it just being the mutation of some common ancestralallele at that marker.

    2.2.2. The proposed model

    We propose to represent each ancestral population bya set of population-specific MAAs. Under this repre-sentation, now the distribution Pk(|k) from which anobserved allele can be sampled becomes a mixture ofinheritance models, each defined on a specific founder.The ensuing sampling module to be plugged into thegeneral admixture scheme outlined above (to replacestep 2.2) becomes a two-step generative process:

    2.2a: draw the latent founder indicator ci,ne |zi,ne =

    k Multinomial(|ki ); 2.2b: draw the allele xi,ne |ci,ne = l, zi,ne = k

    Pm(|ki,l,

    ki,l),

    where Pm() is a mutation model that can be flexiblydefined based on whether the genetic markers are mi-crosatellites or single nucleotide polymorphisms. Wecall this model an admixture of population-specific in-heritance models (AdMim), while the previous modelis technically only an admixture of population specificallele frequency profiles. Figure 2(a) shows a graphi-cal model the overall generative scheme for AdMim, incomparison with the admixture of population-specificallele rates discussed earlier. From the figure, we canclearly see that Structure is virtually identical to anLDA model, while mStructis an extended LDA modelwhich allows noisy observations.

    For simplicity of presentation, in the model describedabove we assume that for a particular individual,the genetic markers at each locus are conditionallyiid samples from a set of population-specific fixed-dimensional mixture of inheritance models, and thatthe set of founder alleles at a particular locus isthe same for all ancestral populations( ki = i).Also our model assumes Hardy-Weinberg equilibriumwithin populations. The simplifying assumptions of

  • 7/28/2019 A New Admixture Model for Inference of Population

    4/8

    mStruct: Structure with mutations

    N

    I

    n

    i,n

    K

    K

    k

    i=1..I

    i=1..I

    k

    c

    i,nx

    i,nz

    (a) mStruct

    N

    I

    n

    K

    k

    i=1..I

    i,nx

    i,nz

    (b) Structure 2.1

    Figure 2. Graphical Models: the circles represent randomvariables and diamonds represent hyperparameters.

    unlinked loci, no linkage disequilibrium between lociwithin populations can be easily removed by incorpo-rating Markovian dependencies over ancestral indica-tors Zi,ne and Zi+1,ne of adjacent loci, and over other

    parameters such as the allele frequencies ki in exactlythe same way as in Structure. We can also introduceMarkovian dependencies over mutation rates at adja-cent loci, which might be desirable to better reflectthe dynamics of molecular evolution in the genome.We defer such extensions to a later paper.

    2.3. Mutation model

    As described above, our model is applicable to almostall kinds of genetic markers by plugging in an appro-priate allele mutation model (i.e., inheritance model)Pm(). We now discuss two mutation models, for mi-crosatellites and SNPs, respectively.

    2.3.1. Microsatellite mutation model

    Microsatellites are the repeats of a small sequences inDNA about 1-4 base pairs in length which are usually

    represented as integer counts. The choice of a suitablemicrosatellite mutation model is important, for bothcomputational and interpretation purposes. Below wediscuss the mutation model that we use and the biolog-ical interpretation of the parameters of the mutationmodel. We begin with a stepwise mutation model formicrosatellites widely used in forensic analysis (Valdeset al., 1993; Lin et al., 2006).

    This model defines a conditional distribution of aprogeny allele b given its progenitor allele a, both ofwhich take continuous values:

    p(b|a) =1

    2(1 )|ba|1

    , (1)

    where is the mutation rate (probability of any mu-tation), and is the factor by which mutation de-creases as distance between the two alleles increases.Although this mutation distribution is not stationary(i.e. it does not ensure allele frequencies to be con-stant over the generations), it is simple and commonlyused in forensic inference. To some degree can beregarded as a parameter that controls the probabil-

    ity of unit-distance mutation, as can be seen from thefollowing identity: p(b + 1|a)/p(b|a) = .

    In practice, the two-parameter stepwise continuousmutation model described above complicates the in-ference process. We propose a discrete microsatellitemutation model that is a simplification of Eq. 1, but

    captures its main idea. We posit that: P(b|a) |ba|.

    It is not hard to show that normalizing this probabilitymass function gives us the mutation model as:

    P(b|a) =1

    1 a + |ba|. (2)

    We can interpret as a variance parameter, the factorby which probability drops as a fuction of the distancebetween the mutated version b of the allele a.

    Determination of founder set at each locus:

    According to our model assumptions, there can be adifferent number of founder alleles at each locus. Thisnumber is typically smaller than the number of alle-

    les observed at each marker since the founder allelesare ancestral. To estimate the appropriate numberand allele states of founders, we fit finite mixtures ofmicrosatellite mutation models, and use the BayesianInformation Criterion (BIC) to determine the cardi-nality of the mixture.

    Choice of mutation prior: In our model, the pa-rameter, as explained above, is a population-specificparameter that controls the probability of stepwisemutations. Being a parameter that controls the vari-ance of the mutation distribution, there is a possibilitythat inference on the model will encourage higher val-ues of to improve the log-likelihood, in the absenceof any prior distribution on . To avoid this situation,and to allow more meaningful and realistic results toemerge from the inference process, we impose on abeta prior that will be biased towards smaller valuesof . The beta prior will be a fixed one and will notbe among the parameters we estimate.

    2.3.2. SNP mutation model

    SNPs, or single nucleotide polymorphisms, representthe largest class of individual differences in DNA. Ingeneral, there is a well-defined correlation between theage of the mutation producing a SNP allele and thefrequency of the allele. For SNPs, we use a simplepointwise mutation model, rather than more complexblock models. Thus, the observations in SNP data areonly binary in nature (0/1). So, given the observedallele b, we say that the probability of it being derivedfrom the founder allele a is given by:

    P(b|a) = I[b=a] (1 )I[b=a]; a, b {0, 1}. (3)

    In this case, the mutation parameter is the proba-bility that the observed allele is not identical to the

  • 7/28/2019 A New Admixture Model for Inference of Population

    5/8

    mStruct: Structure with mutations

    founder allele, but derived from it due to a mutation.

    2.4. Inference and Parameter Estimation

    2.4.1. Probability distribution on the model

    For notational convenience, we will ignore the diploidnature of observations in the analysis that follows.

    With the understanding that the analysis is carriedout for the nth individual, we will drop the subscriptn. Also, we overload the indicator variables zi andci to be both, arrays with only one element equal to1 , as well as scalars with a value equal to the indexat which the array forms have 1s. In other words:zi 1, . . . ,K , ci 1, . . . ,L , zi,k = I[zi = k], andci,l = I[ci = l].

    The joint probability distribution of the the data andthe relevant variables under the AdMim model canthen be written as:

    px

    ,z,

    c,|,,,

    = p

    | IYi=1

    pzi|

    pci|zi,

    k=1K

    i.

    The marginal likelihood of the data can be com-puted by summing/integrating out the latent vari-ables. However, a closed-form solution to this sum-mation/integration is not possible, and indeed exactinference on hidden variables such as the map vector, and estimation of model parameters such as the mu-tation rates under AdMim is intractable.(Pritchardet al., 2000) developed an MCMC algorithms for ap-proximate inference for their admixture model under-lying Structure. We choose to apply a computationallymore efficient approximate inference method known asvariational inference (Jordan et al., 1999).

    2.5. Variational Inference

    We use a mean-field approximation for performinginference on the model. This approximation methodapproximates an intractable joint posterior p() of theall hidden variables in the model by a product ofmarginal distributions q() =

    qi(), each over only a

    single hidden variable. The optimal parameterizationofqi() for each variable is obtained by minimizing theKullback-Leibler divergence between the variationalapproximation q and the true joint posterior p. Usingresults from the the Generalised Mean Field theory(Xing et al., 2003), we can write the variationaldistributions of the latent variables as follows:

    q() KYk=1

    k1+

    PIi=1 zi,k

    k

    q(ci) LYl=1

    KYk=1

    ki,lf(xi|i,l,

    ki )zi,k!ci,l

    q(zi) KYk=1

    elog(k)

    LYl=1

    ki,lf(xi|i,l,

    ki )

    ci,l

    !!zi,k.

    In the distributions above, the are used toindicate the expected values of the enclosed ran-dom variables. A close inspection of the aboveformulae reveals that these variational distribu-tions have the form q() Dirichlet(1, . . . , K),q(zi) Multinomial(i,1, . . . , i,K), and q(ci) Multinomial(i,1, . . . , i,L), respectively, where the

    parameters k, i,k and i,l are given by the followingequations:k = k +

    IXi=1

    zi,k

    i,k =elog(k)

    QLl=1

    ki,lf(xi|i,l,

    ki )

    ci,l

    PKk=1

    elog(k)

    QLl=1

    ki,lf(xi|i,l,

    ki )

    ci,l

    i,k =

    QKk=1

    `ki,lf(xi|i,l,

    ki )zi,k

    PKk=1

    QKk=1

    ki,lf(xi|i,l,

    ki )zi,k

    and they have the properties: log(k) = k, zi,k =i,k and ci,l = i,l, which suggest that they can be

    computed via fixed point iterations. It can be shownthat this iteration will converge to a local optimum,similar to what happens in an EM algorithm. Empiri-cally, a near global optimal can be obtained by multi-ple random restarts of the fixed point iteration. Typi-cally, such a mean-field variational inference convergesmuch faster then sampling (Xing et al., 2003).

    3. Hyperparameter Estimation

    The parameters of our model, i.e., {,,}, and theDirichlet hyperparameter , can be estimated by max-imizing the lower bound on the log-likelihood as a func-tion of the current values of the hyperparameters, via a

    variational EM algorithm. Due to space limits, detailsof this empirical Bayes estimation scheme are availablein the extended version.

    4. Experiments and Results

    We validated our model on a synthetic microsatellitedataset where the simulated values of the hidden mapvector n of each individual and the population pa-rameters {k, k, k} of each ancestral population areknown as ground truth. The goal is to assess the per-formance of mStruct in terms of accuracy and con-sistency of the estimated map vectors and populationparameters, and test of the correctness of the infer-ence and estimation algorithms we developed. Wealso conduct empirical analysis using mStruct of tworeal datasets: the HGDP-CEPH cell line panel of mi-crosatellite loci and the HGDP SNP data, in compar-ison with the Structureprogram (version 2.1).

    4.1. Validations on Synthetic Data

    We simulated 20 microsatellite genotype datasets us-ing the AdMim generative process described in sec-tion 2.2, with 100 diploid individuals from 2 ances-

  • 7/28/2019 A New Admixture Model for Inference of Population

    6/8

    mStruct: Structure with mutations

    1 50 1000

    0.5

    1

    1 50 1000

    0.5

    1

    0 50 1000

    0.5

    1

    Figure 3. Ancestry spectra for a 3-population simulateddataset. First panel shows the true ancestry proportionvectors. Middle panel shows the estimate by mStruct.Right panel shows the estimate from Structure .

    tral populations, at 50 genotype loci. Each locus has4 founding alleles, separated by adjustable distances;the mutation parameter at each locus for both pop-ulations had default value 0.1, but can be varied tosimulate different degrees of divergence. The foundingallele frequencies, ki , were drawn from a flat Dirich-let prior with parameter 1. The map vectors n weresampled from a symmetric beta distribution with pa-rameter , allowing different levels of admixing. Weexamine the accuracies of several estimates of interestunder a number of different simulation conditions, andfor each condition we report the statistics of the accu-

    racies across 20 iid synthetic datasets. Due to spacelimitations, we only report two experiments below; theadditional results on accuracy of ancestral alleles re-covery and the ancestral-allele frequency estimationare available in the full paper.

    4.1.1. Accuracy of population Map estimate

    The map vector n reflects the proportions of con-tributions from different ancestral population to themaker-alleles of each individual. The display of themap vectors of all individuals in a study populationgives a Map of population structure (see, e.g., Fig. 1in the introduction), which has been the main output

    of the Structureprogram . We compare the accuracyof the estimated n w.r.t. the ground truth recordedduring the simulation in terms of their L1 distances.

    Figure 3 shows an example of this comparison, and wecan see that mStruct is visually more accurate thanStructure. Figure 4 shows the accuracy of the Mapestimate by mStruct on synthetic datasets simulatedwith different properties, in comparison with that ofStructure. Fig. 4(a) shows that, under different de-grees of biases of population admixing induced by theBeta prior of n, mStruct consistently outperformsStructure. Specifically, as the value of the Beta priorhyperparameter increases, fewer individuals tend tobelong completely to only one population, and moreand more individuals become highly admixed. As thefigure shows, the performance of both methods de-grades as we progress toward this end; however, theseverity of degradation of mStruct is much less thanthat of Structure. mStruct remains robust and per-forms better than Structureas the separations betweenfounding alleles decreases (Fig.4(b)), which tends toincreasingly confound the ancestral origins of modern

    alleles. Finally, Fig. 4(c) shows how the presence ofmutations affects the performance of both methods.At very low values of the mutation parameters, theperformances of both models are comparable; but asthe mutation parameter increases in magnitude, theperformance of Structure degrades significantly. Onthe other hand, the decrease in accuracy for mStructis hardly noticeable. This shows that our model isresistant to the confounding effect of large mutations.

    0.5 1 20

    0.1

    0.2

    0.3

    0.4

    0.5

    Dirichlet Parameter

    ||estimatetrue

    ||1

    MstructDumb Structure

    5 20 500

    0.1

    0.2

    0.3

    0.4

    0.5

    Separation between adjacent founder alleles

    ||estimatetrue

    ||1

    MstructDumb Structure

    0.01 0.1 0.50

    0.1

    0.2

    0.3

    0.4

    0.5

    Mutation parameter

    ||estimatetrue

    ||1

    MstructDumb Structure

    (a) (b) (c)

    Figure 4. Accuracy of est. under different conditions.

    4.1.2. Accuracy of parameter estimation.

    An important aspect of guarantee and utility we desirefor our model and inference algorithm is that it shouldoffer consistent estimates of the population parameters{k, k, k} underlying the composition of the ances-tral population and their inheritance processes. Theseestimates offer important insight of the evolutionaryhistory and dynamics of modern population genotypedata. We have extensively investigated the robustnessand accuracy of all these estimates. Due to space lim-itations, here we only report highlights of mutationrate estimation.

    Mutation parameter estimation: We evaluatethe performance at recovery of ks by a simple dis-tance measure, (L1 distance measure), between the

    true and inferred values. We expect that using thebeta prior described earlier improves the recoveryof the population-specific mutation parameters. Asshown in Figure 5, the estimates of ks are robustand remain low-bias under different degree of admixing(due to changing ) and different ancestor dispersion(due to changing distances among the ks). The accu-racy decreases as the value of the mutation parameteritself increases, but remains respectable, as shown inFigure. 5(c).

    0.5 1 20

    0.05

    0.1

    0.15

    0.2

    Dirichlet hyperparameter

    |e

    stimate

    true

    |

    5 20 500

    0.05

    0.1

    0.15

    0.2

    Separation between adjacent founder alleles

    |e

    stimate

    true

    |

    0.01 0.1 0.50

    0.05

    0.1

    0.15

    0.2

    Mutation Parameter

    |estimate

    true

    |

    (a) (b) (c)

    Figure 5. Accuracy of microsatellite mutation para. est.

    4.2. Empirical Analysis of Real Datasets

    The HGDP-CEPH cell line panel (Cann et al., 2002;Cavalli-Sforza, 2005) used in (Rosenberg et al.,2002)contains genotype information from 1056 indvid-

  • 7/28/2019 A New Admixture Model for Inference of Population

    7/8

    mStruct: Structure with mutations

    uals from 52 populations at 377 autosomal microsatel-lite loci, along with geographical and population la-bels. The HGDP SNP data (Conrad et al., 2006) con-tains the SNPs genotypes at 2834 loci of 927 unrelatedindividuals that overlap with the HGDP-CEPH data.To make results for both types of data comparable, wechose the set of only those individuals present in bothdatasets. As in (Rosenberg et al., 2002), the choice ofthe total number of ancestral populations is left to theuser, and here we only show results ofK = 4 due tospace limitations.

    4.2.1. Structural maps from HGDP data

    We compare the structural maps inferred from boththe microsatellite and the SNP data using mStructand Structure(top panels in Figure 6). The structuralmaps produced by both programs are quite similar inthe case of SNPs, but are very different for microsatel-lites. The most obvious difference between the mapsproduced by both programs is the degree of admix-

    ing that the individuals in the program are assigned.Structure assigns each geographical population to adistinct profile. Thus, it seems to predict very littleadmixing effect in modern human populations. Whileuseful for clustering, this might result in loss of po-tentially useful information about actual evolutionaryhistory of populations. In contrast, the structure mapproduced by mStruct for microsatellites suggests thatall populations share a common ancestral populationwith a unique extra component that characterizes theirparticular genotypes. It is interesting to note that clus-tering individuals by the ancestry proportion vectorsdue to mStruct will produce exactly the same cluster-

    ing partitions as that due to Structure. The structuralmaps produced in the case of SNP data are quite simi-lar for both softwares, with results from mStructagainpredicting more admixture than Structure. It is alsointeresting to see that the ancestry proportions for Eu-ropean and Middle Eastern regions are more distinctfrom each other in mStruct than in Structure, allow-ing for better separation of the two geographical re-gions. A possible cause for the inconsistency betweenthe results produced by mStructfor SNP data and mi-crosatellite data could be the large difference betweentheir mutation rates, or due to the choice of a simplis-tic SNP mutation model.This issue will be explored inmore detail in the full version of the paper.4.2.2. Analysis of the mutation spectrums

    Now we report a preliminary analysis of the evolu-tionary dynamics reflected by the estimated mutationspectrums of different ancestral populations (denotedam-spectrum), and of different modern geograph-ical populations (denoted gm-spectrum), which isnot possible by Structure. For the am-spectrum, we

    compute the mean mutation rates over all loci andfounding alleles for each ancestral population as esti-mated by mStruct. We estimate the gm-spectrum asfollows: for every individual, a mutation rate is com-puted as the per-locus number of observed alleles thatare attributed to mutations, weighted by the muta-tion rate corresponding to the ancestral allele chosenfor that locus. This can be computed by observingthe population-indicator (Z) and the allele-indicator(C) for each individual. We then compute the popu-lation mutation rates by averaging mutation rates ofall individuals having the same geographical label.

    As shown in the gm-spectrums in Figure 6 (lower sub-panels on the right), the mutation rates for Africanpopulations are indeed higher than those of other mod-ern populations. This indicates that they diverged ear-lier, a common hypothesis of human migration. Othertrends in the gm-spectrums also reveal interesting in-sights, which we do not have space to discuss. The

    am-spectrums of SNP data in Figure 6 suggest that thefounder ancestral population that dominates modernAfrican populations has a higher mutation rate thanthe other ancestral population, indicating that is theolder of the two ancestral populations. The mutationestimates are largely consistent for both microsatellitesand SNPs in comparative order, but vastly different innumerical values.

    4.3. Model selection

    As with all probabilistic models, we face a tradeoffbetween model complexity and the log-likelihood valuethat the model achieves. In our case, complexity is

    controlled by the number of ancestral populations wepick, K.

    2 3 4 54.2

    4.25

    4.3

    4.35

    4.4x 10

    6

    Number of ancestral populations

    BICv

    alue

    SNP data

    Microsatellite data

    Optimalvalue of K

    Figure 7. BIC scores forK=2 to 5.

    Unlike non-parametric orinfinite dimensional mod-els (e.g., Dirichlet pro-cesses etc.), for modelsof fixed dimension, it isnot clear in general asto what value ofK givesus the best balance be-tween model complexityand log-likelihood. Insuch cases, different infor-mation criteria are often used to determine the optimalmodel complexity. To determine what number of an-cestral populations fit the HGDP SNP and microsatel-lite data best, we computed BIC scores for K=2 toK=5 for both kinds of data separately. The resultsare shown in Figure 7. The BIC curves for both SNPsand microsatellites suggest K=4 as the best fit for thedata.

  • 7/28/2019 A New Admixture Model for Inference of Population

    8/8

    mStruct: Structure with mutations

    Afric

    a

    Europe

    MiddleEa

    st

    CentralS

    outh

    Asia

    East

    Asia

    Oceania

    America

    mStruct

    Structure2.1

    am-spectrum gm-spectrum

    Afric

    a

    Europe

    Mid

    dleEa

    st

    CentralS

    outh

    Asia

    East

    Asia

    Oceania

    America

    mStruct

    Structure2.1

    am-spectrum gm-spectrum

    (a) from Microsatellite data (b) from SNP data

    Figure 6. Structural maps, mutation spectrums, from the HGDP data via mStructand Structure.

    5. Discussions

    We have developed mStruct, which allows estimationof genetic contributions of ancestral populations ineach modern individual in light of both populationadmixture and allele mutation. The variational infer-ence algorithm that we developed allows tractable ap-proximate inference on the model. The ancestral pro-

    portions of each individual enable representing pop-ulation structure in a way that is both visually easyto interpret, as well as amenable to further computa-tional analysis. In conjunction with geographical loca-tion, the inferred ancestry proportions could be usedto detect migrations,sub-populations etc quite easily.Moreover, the ability to estimate population and locusspecific mutation rates also allows us to substantiateevolutionary dynamics claims based on high/low mu-tation rates in certain geographical population, or onhigh/low mutation rates at certain loci in the genome.While the estimates of mutation rates that mStructprovides are not on an absolute scale, the comparison

    of their relative magnitudes is certainly informative.As of now, there remain a number of possible exten-sions to the methodology we presented so far. It wouldbe instructive to see the impact of allowing linked locias in (Falush et al., 2003). We have not yet addressedthe issue of the most suitable choice of mutation pro-cess, but instead have chosen one that is reasonableand computationally tractable. It would be interestingto combine mStruct with the nonparametric Bayesianmodels based on the Dirichlet processes such as (Sohn& Xing, 2007). Our model might be described as anoisy-channel version of the LDA model, where theobservations are modified instances of the original al-leles. It is not hard to imagine applications of thismodel to other tasks such as image modeling or IRtasks involving noisy data, with minor changes in thedistributions from which the observations are sampled.

    Acknowledgements

    This research was supported in part by NSF GrantCCF-0523757 and DBI-0546594.

    References

    Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichletallocation. Journal of Machine Learning Research, 3,9931022.

    Cann, H., de Toma, C., Cazes, L., Legrand, M., Morel, V.,Piouffre, L., Bodmer, J., Bodmer, W., Bonne-Tamir, B.,Cambon-Thomsen, A., et al. (2002). A Human GenomeDiversity Cell Line Panel. Science, 296, 261262.

    Cavalli-Sforza, L. (2005). The Human Genome DiversityProject: past, present and future. Nat Rev Genet, 6,33340.

    Conrad, D., Jakobsson, M., Coop, G., Wen, X., Wall, J.,Rosenberg, N., & Pritchard, J. (2006). A worldwidesurvey of haplotype variation and linkage disequilibriumin the human genome. Nat Genet, 38, 12511260.

    Erosheva, E., Fienberg, S., & Lafferty, J. (2004).Mixed-membership models of scientific publications.Proceedings of the National Academy of Sciences, 101,52205227.

    Excoffier, L., & Hamilton, G. (2003). Comment onGenetic Structure of Human Populations. Science, 300,18771877.

    Falush, D., Stephens, M., & Pritchard, J. (2003). Inferenceof Population Structure Using Multilocus Genotype

    Data Linked Loci and Correlated Allele Frequencies.Genetics, 164, 15671587.

    Jordan, M., Ghahramani, Z., Jaakkola, T., & Saul, L.(1999). An Introduction to Variational Methods forGraphical Models. Machine Learning, 37, 183233.

    Lin, T., Myers, E., & Xing, E. (2006). Interpreting anony-mous DNA samples from mass disastersprobabilisticforensic inference using genetic markers. Bioinformatics,22, e298.

    Pritchard, J., Stephens, M., & Donnelly, P. (2000).Inference of Population Structure Using MultilocusGenotype Data. Genetics, 155, 945959.

    Rosenberg, N., Pritchard, J., Weber, J., Cann, H., Kidd,K., Zhivotovsky, L., & Feldman, M. (2002). GeneticStructure of Human Populations. Science, 298, 2381

    2385.Sohn, K., & Xing, E. (2007). Spectrum: joint bayesianinference of population structure and recombinationevents. Bioinformatics, 23, i479.

    Valdes, A., Slatkin, M., & Freimer, N. (1993). Allele Fre-quencies at Microsatellite Loci: The Stepwise MutationModel Revisited. Genetics, 133, 737749.

    Xing, E., Jordan, M., & Russell, S. (2003). A generalizedmean field algorithm for variational inference in expo-nential families. Uncertainty in Artificial Intelligence(UAI2003). Morgan Kaufmann Publishers.