momi documentation - media.readthedocs.org filechapter 1 introduction momi (moran models for...

momi DocumentationRelease 2.0.1

Jack Kamm, Jonathan Terhorst, Richard Durbin, Yun Song

Dec 09, 2019

Contents

1 Introduction 1

2 Installation 32.1 Installing with conda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Installing with pip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Parallelization 5

4 Tutorial 74.1 Constructing a demographic history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.2 Plotting a demography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.3 Reading and simulating data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.5 Statistics of the SFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.6 Bootstrap confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.7 Other features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 API Documentation 235.1 Demographic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.4 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6 Indices and tables 37

Index 39

i

CHAPTER 1

Introduction

momi (MOran Models for Inference) is a Python package that computes the expected sample frequency spectrum(SFS), a statistic commonly used in population genetics, and uses it to fit demographic history.

The code is on github, and a preprint describing the method can be found on bioRxiv.

1

https://github.com/jackkamm/momi2

https://www.biorxiv.org/content/early/2018/03/23/287268

momi Documentation, Release 2.0.1

2 Chapter 1. Introduction

CHAPTER 2

Installation

momi requires Python >= 3.5, and can be installed with conda or pip.

2.1 Installing with conda

1. Download anaconda or miniconda.

2. (Optional) create a separate conda environment to install into:

conda create -n my-momi-envconda activate my-momi-env

3. Install:

conda install momi -c conda-forge -c bioconda -c jackkamm

Note the order of the -c flags matters, it determines the priority of each channel when installing dependencies.

2.2 Installing with pip

The momi source distribution is provided on PyPi, and can be downloaded, built, and installed with pip.

First, ensure the following non-Python dependencies are installed with your favorite package manager (e.g. apt-get,yum, brew, conda, etc):

1. hdf5

2. gsl

3. (OSX only) OpenMP-enabled clang

• If using homebrew, do brew install llvm libomp.

• Or if using conda, do conda install llvm-openmp clang.

3

https://conda.io/docs/

https://pip.readthedocs.io/en/stable/

https://www.anaconda.com/download/

https://conda.io/miniconda.html

https://conda.io/docs/user-guide/tasks/manage-environments.html


• You will also need to set the environment variable CC=/path/to/clang during installation.

Then do pip install momi.

On OSX, remember to set the environment variable:

CC=/path/to/clang pip install momi

If you installed the above dependencies using homebrew, this should be:

CC=$(brew --prefix llvm)/bin/clang pip install momi

Depending on your system, pipmay have trouble installing some dependencies (such as numpy, msprime, pysam).In this case, you should manually install these dependencies and try again.

See venv to install into a virtual environment.

2.3 Troubleshooting

2.3.1 “ModuleNotFoundError: No module named ‘momi.convolution’”

This is usually caused by trying to import momi when in the top-level folder of the momi2 project. In this case,Python will try to import the local, unbuilt copy of the momi subdirectory rather than the installed version.

To fix this, simply cd out of the top-level directory before importing momi.

2.3.2 “clang: error: unsupported option ‘-fopenmp’”

On macOS the system version of clang does not support OpenMP, which causes this error when building momi withpip.

To solve this, make sure you have OpenMP-enabled LLVM/clang installed, and set the environment variable CC asnoted in the pip installation instructions above.

Note: it is NOT recommended to replace clang with gcc on macOS, as this can cause strange numerical errors whenused with Intel MKL; for example, see https://github.com/ContinuumIO/anaconda-issues/issues/8803

4 Chapter 2. Installation

https://docs.python.org/3/tutorial/venv.html

https://github.com/ContinuumIO/anaconda-issues/issues/8803

CHAPTER 3

Parallelization

momi will automatically use all available CPUs to perform computations in parallel. You can control the number ofthreads by setting the environment variable OMP_NUM_THREADS.

To take full advantage of parallelization, it is recommended to make sure numpy is linked against a parallel BLASimplementation such as MKL or OpenBlas. This is automatically taken care of in most packaged, precompiled versionsof numpy, such as in Anaconda Python.

5


6 Chapter 3. Parallelization

CHAPTER 4

Tutorial

This is a tutorial for the momi package. You can run the ipython notebook that created this tutorial at docs/tutorial.ipynb.

To get started, import the momi package:

[1]: import momi

Some momi operations can take awhile complete, so it is useful to turn on status monitoring messages to check thateverything is running normally. Here, we output logging messages to the file tutorial.log.

[2]: import logginglogging.basicConfig(level=logging.INFO,

filename="tutorial.log")

4.1 Constructing a demographic history

Use DemographicModel to construct a demographic history. Below, we set the diploid effective size N_e=1.2e4,the generation time gen_time=29 years per generation, and mutation rate muts_per_gen=1.25e-8 per baseper generation.

[3]: model = momi.DemographicModel(N_e=1.2e4, gen_time=29,muts_per_gen=1.25e-8)

Use DemographicModel.add_leaf to add sampled populations. Below we add 3 populations: YRI, CHB, and NEA.The archaic NEA population is sampled t=5e4 years ago. The YRI population has size N=1e5, while the CHBpopulation is initialized to have size N=1e5 and growth rate g=5e-4 per year (NEA starts at the default size 1.2e4).

[4]: # add YRI leaf at t=0 with size N=1e5model.add_leaf("YRI", N=1e5)# add CHB leaf at t=0, N=1e5, growing at rate 5e-4 per unit time (year)

(continues on next page)

7


(continued from previous page)

model.add_leaf("CHB", N=1e5, g=5e-4)# add NEA leaf at 50kya and default Nmodel.add_leaf("NEA", t=5e4)

Demographic events are added to the model by the methods DemographicModel.set_size and Demographic-Model.move_lineages. DemographicModel.set_size is used to change population size and growth rate, whileDemographicModel.move_lineages is used for population split and admixture events.

[5]: # stop CHB growth at 10kyamodel.set_size("CHB", g=0, t=1e4)

# at 45kya CHB receive a 3% pulse from GhostNeamodel.move_lineages("CHB", "GhostNea", t=4.5e4, p=.03)# at 55kya GhostNea joins onto NEAmodel.move_lineages("GhostNea", "NEA", t=5.5e4)

# at 80 kya CHB goes thru bottleneckmodel.set_size("CHB", N=100, t=8e4)# at 85 kya CHB joins onto YRI; YRI is set to size N=1.2e4model.move_lineages("CHB", "YRI", t=8.5e4, N=1.2e4)

# at 500 kya YRI joins onto NEAmodel.move_lineages("YRI", "NEA", t=5e5)

Note that events can involve other populations aside from the 3 sampled populations YRI, CHB, and NEA. Unsampledpopulations are also known as “ghost populations”. In this example, CHB receives a small amount of admixture froma population “GhostNea”, which splits off from NEA at an earlier date.

4.2 Plotting a demography

momi relies on matplotlib for plotting. In a notebook, first call %matplotlib inline to enable matplotlib, thenyou can use DemographyPlot to create a plot of the demographic model.

[6]: %matplotlib inline

yticks = [1e4, 2.5e4, 5e4, 7.5e4, 1e5, 2.5e5, 5e5, 7.5e5]

fig = momi.DemographyPlot(model, ["YRI", "CHB", "GhostNea", "NEA"],figsize=(6,8),major_yticks=yticks,linthreshy=1e5, pulse_color_bounds=(0,.25))

8 Chapter 4. Tutorial

https://matplotlib.org/


Note the user needs to specify the order of all populations (including ghost populations) along the x-axis.

The argument linthreshy is useful for visualizing demographic events at different scales. In our example, the splittime of NEA is far above the other events. Times below linthreshy are plotted on a linear scale, while times aboveit are plotted on a log scale.

4.3 Reading and simulating data

In this section we demonstrate how to read in data from a VCF file. We start by simulating a dataset so that we canread it in later.

4.3.1 Simulating data

Use DemographicModel.simulate_vcf to simulate data (using msprime) and save the resulting dataset to a VCF file.

Below we simulate a dataset of diploid individuals, with 20 “chromosomes” of length 50Kb, with a recombination rateof 1.25e-8.

4.3. Reading and simulating data 9

https://msprime.readthedocs.io/


[7]: recoms_per_gen = 1.25e-8bases_per_locus = int(5e5)n_loci = 20ploidy = 2

# n_alleles per population (n_individuals = n_alleles / ploidy)sampled_n_dict = {"NEA":2, "YRI":4, "CHB":4}

# create data directory if it doesn't exist!mkdir -p tutorial_datasets/

# simulate 20 "chromosomes", saving each in a separate vcf filefor chrom in range(1, n_loci+1):

model.simulate_vcf(f"tutorial_datasets/{chrom}",recoms_per_gen=recoms_per_gen,length=bases_per_locus,chrom_name=f"chr{chrom}",ploidy=ploidy,random_seed=1234+chrom,sampled_n_dict=sampled_n_dict,force=True)

We saved the datasets in tutorial_datasets/$chrom.vcf.gz. Accompanying tabix and bed files are alsocreated.

[8]: !ls tutorial_datasets/

10.bed 14.bed 18.bed 2.bed 6.bed10.vcf.gz 14.vcf.gz 18.vcf.gz 2.vcf.gz 6.vcf.gz10.vcf.gz.tbi 14.vcf.gz.tbi 18.vcf.gz.tbi 2.vcf.gz.tbi 6.vcf.gz.tbi11.bed 15.bed 19.bed 3.bed 7.bed11.vcf.gz 15.vcf.gz 19.vcf.gz 3.vcf.gz 7.vcf.gz11.vcf.gz.tbi 15.vcf.gz.tbi 19.vcf.gz.tbi 3.vcf.gz.tbi 7.vcf.gz.tbi12.bed 16.bed 1.bed 4.bed 8.bed12.vcf.gz 16.vcf.gz 1.vcf.gz 4.vcf.gz 8.vcf.gz12.vcf.gz.tbi 16.vcf.gz.tbi 1.vcf.gz.tbi 4.vcf.gz.tbi 8.vcf.gz.tbi13.bed 17.bed 20.bed 5.bed 9.bed13.vcf.gz 17.vcf.gz 20.vcf.gz 5.vcf.gz 9.vcf.gz13.vcf.gz.tbi 17.vcf.gz.tbi 20.vcf.gz.tbi 5.vcf.gz.tbi 9.vcf.gz.tbi

4.3.2 Read in data from vcf

Now we read in the datasets we just simulated.

The first step is to create a mapping from individuals to populations. We save this mapping to a text file whose firstcolumn is for individuals and second column is for populations.

[9]: # a dict mapping samples to populationsind2pop = {}for pop, n in sampled_n_dict.items():

for i in range(int(n / ploidy)):# in the vcf, samples are named like YRI_0, YRI_1, CHB_0, etcind2pop["{}_{}".format(pop, i)] = pop

with open("tutorial_datasets/ind2pop.txt", "w") as f:





for i, p in ind2pop.items():print(i, p, sep="\t", file=f)

!cat tutorial_datasets/ind2pop.txt

NEA_0 NEAYRI_0 YRIYRI_1 YRICHB_0 CHBCHB_1 CHB

Compute allele counts

The next step is to compute the allele counts for each VCF separately. To do this, use the shell command python-m momi.read_vcf $VCF $IND2POP $OUTFILE --bed $BED:

[10]: %%shfor chrom in `seq 1 20`;do

python -m momi.read_vcf \tutorial_datasets/$chrom.vcf.gz tutorial_datasets/ind2pop.txt \tutorial_datasets/$chrom.snpAlleleCounts.gz \--bed tutorial_datasets/$chrom.bed

done

The --bed flag specifies a BED accessible regions file; only regions present in the BED file are read from the VCF.The BED file also determines the length of the data in bases. If no BED file is specified, then all SNPs are read, andthe length of the data is set to unknown.

You should NOT use the same BED file across multiple VCF files, and should ensure your BED files do not containoverlapping regions. Otherwise, regions will be double-counted when computing the length of the data. You can usetabix to split a single BED file into multiple non-overlapping files.

By default ancestral alleles are read from the INFO AA field (SNPs missing this field are skipped) but this behaviorcan be changed by setting the flags --no_aa or --outgroup.

Use the --help flag to see more command line options, and see also the documentation for SnpAllele-Counts.read_vcf , which provides the same functionality within Python.

Extract combined SFS

Use python -m momi.extract_sfs $OUTFILE $NBLOCKS $COUNTS... from the command line tocombine the SFS across multiple files, and split the SFS into a number of equally sized blocks for jackknifing andbootstrapping.

[11]: %%shpython -m momi.extract_sfs tutorial_datasets/sfs.gz 100 tutorial_datasets/*.→˓snpAlleleCounts.gz

Use the --help flag to see the command line options, and see also the documentation for SnpAllele-Counts.concatenate and SnpAlleleCounts.extract_sfs which provide the same functionality within Python.

4.3. Reading and simulating data 11


Read SFS into Python

Finally, read the SFS file into Python with Sfs.load:

[12]: sfs = momi.Sfs.load("tutorial_datasets/sfs.gz")

4.4 Inference

In this section we will infer a demography for the data we simulated. We will start by fitting a sub-demography onCHB and YRI, and then iteratively build on this model, by adding the NEA population and also additional parametersand events.

4.4.1 An initial model for YRI and CHB

We will start by fitting a simplifed model without admixture. Use DemographicModel() to initialize it as before:

[13]: no_pulse_model = momi.DemographicModel(N_e=1.2e4, gen_time=29, muts_per_gen=1.25e-8)

Note that muts_per_gen is optional, and can be omitted if unknown, but specifying it provides extra power to themodel.

Use DemographicModel.set_data to add data to the model for inference:

[14]: no_pulse_model.set_data(sfs)

To add parameters to the model, use DemographicModel.add_size_param, DemographicModel.add_time_param, De-mographicModel.add_growth_param, and DemographicModel.add_pulse_param.

Below we define parameters for the CHB size, the CHB growth rate, and the CHB-YRI split time:

[15]: # random initial valueno_pulse_model.add_size_param("n_chb")# initial value 0; user-specified lower,upper boundsno_pulse_model.add_growth_param("g_chb", 0, lower=-1e-3, upper=1e-3)# random initial value; user-specified lower boundno_pulse_model.add_time_param("t_chb_yri", lower=1e4)

Demographic events can be added similarly as before. Parameters are specified by name (string), while constants arespecified as numbers (float).

[16]: no_pulse_model.add_leaf("CHB", N="n_chb", g="g_chb")no_pulse_model.add_leaf("YRI", N=1e5)no_pulse_model.set_size("CHB", t=1e4, g=0)no_pulse_model.move_lineages("CHB", "YRI", t="t_chb_yri", N=1.2e4)

Use DemographicModel.optimize to search for the MLE. It is a thin wrapper around scipy.optimize.minimize andaccepts similar arguments.

[17]: no_pulse_model.optimize(method="TNC")


https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html


/home/jack/miniconda3/envs/momi2-conda-nomkl/lib/python3.6/site-packages/autograd/→˓numpy/numpy_vjps.py:444: FutureWarning: Using a non-tuple sequence for→˓multidimensional indexing is deprecated; use àrr[tuple(seq)]` instead of→˓àrr[seq]`. In the future this will be interpreted as an array index, àrr[np.→˓array(seq)]`, which will result either in an error or a different result.return lambda g: g[idxs]

[17]: fun: 0.003629860012261981jac: array([-6.52603508e-06, -3.58838649e-02, -8.63552150e-10])

kl_divergence: 0.003629860012261981log_likelihood: -30444.557113445273

message: 'Converged (|f_n-f_(n-1)| ~= 0)'nfev: 56nit: 24

parameters: ParamsDict({'n_chb': 14368074.379920935, 'g_chb': 0.→˓000997783661801449, 't_chb_yri': 114562.32318043888})

status: 1success: True

x: array([1.64805192e+01, 9.97783662e-04, 1.04562323e+05])

The default optimization method is method="TNC" (truncated Newton conjugate). This is very accurate but can beslow for large models; for large models, method="L-BFGS-B" is a good choice.

We can print the inferred parameter values with DemographicModel.get_params:

[18]: no_pulse_model.get_params()

[18]: ParamsDict({'n_chb': 14368074.379920935, 'g_chb': 0.000997783661801449, 't_chb_yri':→˓114562.32318043888})

and we can plot the inferred demography as before:

[19]: # plot the modelfig = momi.DemographyPlot(no_pulse_model, ["YRI", "CHB"],

figsize=(6,8), linthreshy=1e5,major_yticks=yticks,pulse_color_bounds=(0,.25))

4.4. Inference 13


4.4.2 Adding NEA to the existing model

Now we add in the NEA population, along with a parameter for its split time t_anc. We use the keywordlower_constraints to require that t_anc > t_chb_yri.

[20]: no_pulse_model.add_leaf("NEA", t=5e4)no_pulse_model.add_time_param("t_anc", lower=5e4, lower_constraints=["t_chb_yri"])no_pulse_model.move_lineages("YRI", "NEA", t="t_anc")

We search for the new MLE and plot the inferred demography:

[21]: no_pulse_model.optimize()

fig = momi.DemographyPlot(no_pulse_model, ["YRI", "CHB", "NEA"],figsize=(6,8), linthreshy=1e5,major_yticks=yticks)



4.4.3 Build a new model adding NEA->CHB

Now we create a new DemographicModel, by copying the previous model and adding a NEA->CHB migrationarrow.

[22]: add_pulse_model = no_pulse_model.copy()add_pulse_model.add_pulse_param("p_pulse", upper=.25)add_pulse_model.add_time_param(

"t_pulse", upper_constraints=["t_chb_yri"])

add_pulse_model.move_lineages("CHB", "GhostNea", t="t_pulse", p="p_pulse")

add_pulse_model.add_time_param("t_ghost", lower=5e4,lower_constraints=["t_pulse"], upper_constraints=["t_anc"])

add_pulse_model.move_lineages("GhostNea", "NEA", t="t_ghost")

It turns out this model has local optima, so we demonstrate how to fit a few independent runs with different startingparameters.

Use DemographicModel.set_params to set new parameter values to start the search from. If a parameter is not specified

4.4. Inference 15


and randomize=True, a new value will be randomly sampled for it.

[23]: results = []n_runs = 3for i in range(n_runs):

print(f"Starting run {i+1} out of {n_runs}...")add_pulse_model.set_params(

# parameters inherited from no_pulse_model are set to their previous valuesno_pulse_model.get_params(),# other parmaeters are set to random initial valuesrandomize=True)

results.append(add_pulse_model.optimize(options={"maxiter":200}))

# sort results according to log likelihood, pick the best (largest) onebest_result = sorted(results, key=lambda r: r.log_likelihood)[-1]

add_pulse_model.set_params(best_result.parameters)best_result

Starting run 1 out of 3...Starting run 2 out of 3...Starting run 3 out of 3...

[23]: fun: 0.006390407169686503jac: array([ 3.86428996e-08, -8.26898179e-02, 4.63629649e-13, -7.

→˓35845677e-14,-2.69540750e-07, -2.98321219e-07, 1.46475502e-08])

kl_divergence: 0.006390407169686503log_likelihood: -60986.74341175791

message: 'Converged (|f_n-f_(n-1)| ~= 0)'nfev: 78nit: 21

parameters: ParamsDict({'n_chb': 11824540.526927987, 'g_chb': 0.001, 't_chb_yri':→˓ 100366.89310119499, 't_anc': 524505.2984953767, 'p_pulse': 0.017396498328507242,→˓'t_pulse': 39601.33887335411, 't_ghost': 50387.43446019594})

status: 1success: True

x: array([ 1.62856876e+01, 1.00000000e-03, 9.03668931e+04, 4.→˓24138405e+05,

-4.03393674e+00, -4.28160158e-01, -7.10966453e+00])

[24]: # plot the modelfig = momi.DemographyPlot(

add_pulse_model, ["YRI", "CHB", "GhostNea", "NEA"],linthreshy=1e5, figsize=(6,8),major_yticks=yticks)

# put legend in upper left cornerfig.draw_N_legend(loc="upper left")

[24]: <matplotlib.legend.Legend at 0x7fdb6ff738d0>



4.5 Statistics of the SFS

Here we discuss how to compute statistics of the SFS, for evaluating the goodness-of-fit of our models, and forestimating the mutation rate.

4.5.1 Goodness-of-fit

Use SfsModelFitStats to see how well various statistics of the SFS fit a model, via the block-jackknife.

Below we create an SfsModelFitStats to evaluate the goodness-of-fit of the no_pulse_model.

[25]: no_pulse_fit_stats = momi.SfsModelFitStats(no_pulse_model)

One important statistic is the f4 or “ABBA-BABA” statistic for detecting introgression (Patterson et al 2012).

In the absence of admixture f4(YRI, CHB, NEA, AncestralAllele) should be 0, but for our dataset it will be negativedue to the NEA->CHB admixture.

Use SfsModelFitStats.f4 to compute f4 stats. For the no-pulse model, we see that f4(YRI, CHB, NEA,AncestralAllele) is indeed negative.

4.5. Statistics of the SFS 17

api.rst#http://www.genetics.org/content/192/3/1065


[26]: print("Computing f4(YRI, CHB, NEA, AncestralAllele)")f4 = no_pulse_fit_stats.f4("YRI", "CHB", "NEA", None)

print("Expected = {}".format(f4.expected))print("Observed = {}".format(f4.observed))print("SD = {}".format(f4.sd))print("Z(Expected-Observed) = {}".format(f4.z_score))

Computing f4(YRI, CHB, NEA, AncestralAllele)Expected = 6.938893903907228e-18Observed = -0.003537103263299063SD = 0.0017511821832936283Z(Expected-Observed) = -2.0198373972983648

The related f2 and f3 statistics are also available via SfsModelFitStats.f2 and SfsModelFitStats.f3.

Another method for evaluating model fit is SfsModelFitStats.all_pairs_ibs, which computes the probability that tworandom alleles are the same, for every pair of populations:

[27]: no_pulse_fit_stats.all_pairs_ibs()

[27]: Pop1 Pop2 Expected Observed Z0 YRI YRI 0.699637 0.706936 1.9249121 NEA NEA 0.732753 0.720787 -1.3885842 CHB NEA 0.545176 0.548234 0.6686363 CHB YRI 0.694800 0.697731 0.6511554 NEA YRI 0.545176 0.543831 -0.3341125 CHB CHB 0.965634 0.964595 -0.231728

Finally, the method SfsModelFitStats.tensor_prod can be used to compute very general statistics of the SFS (specifi-cally, linear combinations of tensor-products of the SFS). See the documentation for more details.

Limitations of SfsModelFitStats

Note the SfsModelFitStats class above has some limitations. First, it computes goodness-of-fit for the SFSwithout any missing data; all entries with missing samples are removed. For datasets with many individuals andpervasive missingness, this can result in most or all of the data being removed.



In such cases you can specify to use the SFS restricted to a smaller number of samples; then all SNPs with at least thatmany of non-missing individuals will be used. For example,

[28]: no_pulse_fit_stats = momi.SfsModelFitStats(no_pulse_model, {"YRI": 2, "CHB": 2, "NEA": 2})

will compute statistics for the SFS restricted to 2 samples per population.

The second limitation of SfsModelFitStats is that it ignores the mutation rate – it only fits the SFS normalizedto be a probability distribution. However, see the next subsection on how to evaluate the total number of mutations inthe data.

4.5.2 Estimating mutation rate

To evaluate the total number of mutations in the data, e.g. to fit the mutation rate, use the method Demographic-Model.fit_within_pop_diversity, which computes the within-population nucleotide diversity, i.e. the heterozygosity ofa random individual in that population assuming Hardy-Weinberg Equilibrium:

[29]: no_pulse_model.fit_within_pop_diversity()

[29]: Pop EstMutRate JackknifeSD JackknifeZscore0 CHB 1.283114e-08 1.624236e-09 0.2038751 YRI 1.215218e-08 1.571356e-10 -2.2135172 NEA 1.301250e-08 4.018888e-10 1.275228

This method returns a dataframe giving estimates for the mutation rate. Note that there is an estimate for each popula-tion – these estimates are non-independent estimates for the same value, just computed in different ways (by computingthe expected to observed heterozygosity for each population separately). These estimates account for missingness inthe data; it is fine to use it on datasets with large amounts of missingness.

Since we initialized our model with muts_per_gen=1.25e-8, the method also returns a Z-value for the residualsof the estimated mutation rates.

4.6 Bootstrap confidence intervals

Use Sfs.resample to create bootstrap datasets by resampling blocks of the SFS.

To generate confidence intervals, we can refit the model on the bootstrap datasets and examine the quantiles of there-inferred parameters. Below we do this for a very small number of bootstraps and a simplified fitting procedure. Inpractice you would want to generate hundreds of bootstraps on a cluster computer.

[30]: n_bootstraps = 5# make copies of the original models to avoid changing themno_pulse_copy = no_pulse_model.copy()add_pulse_copy = add_pulse_model.copy()

bootstrap_results = []for i in range(n_bootstraps):

print(f"Fitting {i+1}-th bootstrap out of {n_bootstraps}")

# resample the dataresampled_sfs = sfs.resample()# tell models to use the new datasetno_pulse_copy.set_data(resampled_sfs)add_pulse_copy.set_data(resampled_sfs)


4.6. Bootstrap confidence intervals 19



# choose new random parameters for submodel, optimizeno_pulse_copy.set_params(randomize=True)no_pulse_copy.optimize()# initialize parameters from submodel, randomizing the new parametersadd_pulse_copy.set_params(no_pulse_copy.get_params(),

randomize=True)add_pulse_copy.optimize()

bootstrap_results.append(add_pulse_copy.get_params())

Fitting 1-th bootstrap out of 5Fitting 2-th bootstrap out of 5Fitting 3-th bootstrap out of 5Fitting 4-th bootstrap out of 5Fitting 5-th bootstrap out of 5

We can visualize the bootstrap results by overlaying them onto a single plot.

[31]: # make canvas, but delay plotting the demography (draw=False)fig = momi.DemographyPlot(

add_pulse_model, ["YRI", "CHB", "GhostNea", "NEA"],linthreshy=1e5, figsize=(6,8),major_yticks=yticks,draw=False)

# plot bootstraps onto the canvas in transparencyfor params in bootstrap_results:

fig.add_bootstrap(params,# alpha=0: totally transparent. alpha=1: totally opaquealpha=1/n_bootstraps)

# now draw the inferred demography on top of the bootstrapsfig.draw()fig.draw_N_legend(loc="upper left")

[31]: <matplotlib.legend.Legend at 0x7fdb702e54e0>



4.7 Other features

4.7.1 Stochastic gradient descent

For large models, it can be useful to perform stochastic optimization: instead of computing the full likelihood at everystep, we use a random subset of SNPs at each step to estimate the likelihood gradient. This is especially useful forrapidly searching for a reasonable starting point, from which full optimization can be performed.

DemographicModel.stochastic_optimize implements stochastic optimization with the ADAM algorithm. Settingsvrg=n makes the optimizer use the full likelihood every n steps which can lead to better convergence (see SVRG).

The cell below performs 10 steps of stochastic optimization, using 1000 random SNPs per step, and computing thefull likelihood every 3 iterations.

[32]: add_pulse_copy.stochastic_optimize(snps_per_minibatch=1000, num_iters=10, svrg_epoch=3)

[32]: fun: 3.5684750877096234jac: array([ 2.16057690e-06, -2.30524330e-03, 2.41528617e-07, -1.

→˓54878487e-07,


4.7. Other features 21

https://arxiv.org/pdf/1412.6980.pdf

https://papers.nips.cc/paper/4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.pdf



1.07274014e-03, -3.48843607e-10, -2.21603171e-09])log_likelihood: 3.5684750877096234

message: 'Maximum number of iterations reached'nit: 9

parameters: ParamsDict({'n_chb': 17322702.843983896, 'g_chb': -0.001, 't_chb_yri→˓': 94934.93099677152, 't_anc': 515875.5851996526, 'p_pulse': 0.0008410052687267091,→˓'t_pulse': 1081.016605046677, 't_ghost': 50008.16334495788})

success: Falsex: array([ 1.66675285e+01, -1.00000000e-03, 8.49349310e+04, 4.

→˓20940654e+05,-7.08007127e+00, -4.46383757e+00, -1.09520024e+01])

4.7.2 Using functions of model parameters as demographic parameters

momi allows you to set a demographic parameter as a function of other model parameters.

For example, suppose we have a parameter t_div for the divergence time between 2 populations, and we would liketo add a pulse event at time t_div / 2.

This can be achieved by passing a lambda function to move_lineages as follows:

model.move_lineages(pop1, pop2, t=lambda params: params.t_div / 2)

The lambda function should take a single argument, params, which has all the defined parameters as attributes, e.g.params.t_div if you have defined a t_div parameter.

As another example, for a population that is exponentially growing since time t, we may parametrize the growth ratein terms of the population sizes at the beginning and end of the epoch:

model.set_size(pop, N="N_end", g=lambda params: np.log(params.N_end/params.N_start)/→˓params.t_start)model.set_size(pop, t="t_start", g=0)

This can be more numerically stable than parametrizing the growth rate directly (since small changes in the growthrate can cause very large changes in population sizes).

Some care is needed – momi internally uses autograd to compute gradients, so the function must be differentiableby autograd. Lambda functions to do simple arithmetic (as above) are fine, however more complex functions withvariable assignments, or that call external methods, need some care, otherwise gradients will be wrong. It is highlyrecommended to read the autograd tutorial before using this feature.


https://github.com/HIPS/autograd

https://github.com/HIPS/autograd/blob/master/docs/tutorial.md

CHAPTER 5

API Documentation

5.1 Demographic models

class momi.DemographicModel(N_e, gen_time=1, muts_per_gen=None)Object for representing and inferring a demographic history.

Parameters

• N_e (float) – the population size, unless manually changed by DemographicModel.set_size()

• gen_time (float) – units of time per generation. For example, if you wish to specifytime in years, and a generation is 29 years, set this to 29. Default value is 1.

• muts_per_gen (float,None) – mutation rate per base per generation. If unknown, setto None (the default). Can be changed with DemographicModel.set_mut_rate()

add_growth_param(name, g0=None, lower=-0.001, upper=0.001, rgen=None)Add growth rate parameter to the demographic model.

Parameters

• name (str) – Parameter name

• g0 (float) – Starting value. If None, randomly sample with rgen

• lower (float) – Lower bound

• upper (float) – Upper bound

• rgen (function) – Function to sample random value. If None, use uniform distribution

add_leaf(pop_name, t=0, N=None, g=None)Add a sampled leaf population to the model.

The arguments t, N, g can be floats or parameter names (strings).

If N or g are specified, then the size or growth rate are also set at the sampling time t. Otherwise theyremain at their previous (default) values.

23

https://docs.python.org/3/library/functions.html#float



https://docs.python.org/3/library/constants.html#None

https://docs.python.org/3/library/stdtypes.html#str





Note that this does not affect the population size and growth below time t, which may be an issue iflineages are moved in from other populations below t. If you need to set the size or growth below t, useDemographicModel.set_size().

Parameters

• pop_name (str) – Name of the population

• t (float,str) – Time the population was sampled

• N (float,str) – Population size

• g (float,str) – Population growth rate

add_parameter(name, start_value=None, rgen=None, scale_transform=None, un-scale_transform=None, scaled_lower=None, scaled_upper=None)

Low-level method to add a new parameter. Most users should instead call DemographicModel.add_size_param(), DemographicModel.add_pulse_param(), DemographicModel.add_time_param(), or DemographicModel.add_growth_param().

In order for internal gradients to be correct, scale_transform and unscale_transform shouldbe constructed using autograd.

Parameters

• name (str) – Name of the parameter.

• start_value (float) – Starting value. If None, use rgen to sample a random startingvalue.

• rgen (function) – Function to get a random starting value.

• scale_transform (function) – Function for internally transforming and rescalingthe parameter during optimization.

• unscale_transform (function) – Inverse function of scale_transform

• scaled_lower (float) – Lower bound after scaling by scale_transform

• scaled_upper (float) – Upper bound after scaling by scale_transform

add_pulse_param(name, p0=None, lower=0.0, upper=1.0, rgen=None)Add a pulse parameter to the demographic model.

Parameters

• name (str) – Parameter name.

• p0 (float) – Starting value. If None, randomly sample with rgen



• rgen (function) – Function to sample random value. If None, use a uniform distribu-tion.

add_size_param(name, N0=None, lower=1, upper=10000000000.0, rgen=None)Add a size parameter to the demographic model.

Parameters


• N0 (float) – Starting value. If None, use rgen to randomly sample


24 Chapter 5. API Documentation








https://github.com/HIPS/autograd/blob/master/docs/tutorial.md














• rgen (function) – Function to sample a random starting value. If None, a truncatedexponential with rate 1 / N_e

add_time_param(name, t0=None, lower=0.0, upper=None, lower_constraints=[], up-per_constraints=[], rgen=None)

Add a time parameter to the demographic model.

Parameters


• t0 (float) – Starting value. If None, use rgen to randomly sample



• lower_constraints (list) – List of parameter names that are lower bounds

• upper_constraints (list) – List of parameter names that are upper bounds

• rgen – Function to sample a random starting value. If None, a truncated exponential withrate 1 / (N_e * gen_time) constrained to satisfy the bounds and constraints.

fit_within_pop_diversity()Estimates mutation rate using within-population nucleotide diversity.

The within-population nucleotide diversity is the average number of hets per individual in the population,assuming it is at Hardy Weinberg equilibrium.

This returns an estimated mutation rate for each (ascertained) population. Note these are non-independentestimates of the same value. It also returns standard deviations from the jacknife.

If DemographicModel.muts_per_gen is set, will also return Z-scores of the residuals.

Return type pandas.DataFrame

get_params(scaled=False)Return an ordered dictionary with the current parameter values.

If scaled=True, returns the parameters scaled in the internal representation used by momi during opti-mization (see also: DemographicModel.add_parameter())

kl_div()The KL-divergence at the current parameter values

log_likelihood()The log likelihood at the current parameter values

move_lineages(pop_from, pop_to, t, p=1, N=None, g=None)Move each lineage in pop_from to pop_to at time t with probability p.

The arguments t, p, N, g can be floats or parameter names (strings).

If N or g are specified, then the size or growth rate of pop_to is also set at this time, otherwise theseparameters remain at their previous values.

Parameters

• pop_from (str) – Population lineages are moved from (backwards in time)

• pop_to (str) – Population lineages are moved to (backwards in time)

• t (float,str) – Time of the event

• p (float,str) – Probability that lineage in pop_from moves to pop_to

5.1. Demographic models 25






https://docs.python.org/3/library/stdtypes.html#list









• N (float,str) – Population size of pop_to

• g (float,str) – Growth rate of pop_to

optimize(method=’tnc’, jac=True, hess=False, hessp=False, printfreq=1, **kwargs)Search for the maximum likelihood value of the parameters.

This is a wrapper around scipy.optimize.minimize(), and arguments for that function can bepassed in via **kwargs. Note the following arguments are constructed by momi and not be passed in by**kwargs: fun, x0, jac, hess, hessp, bounds.

Parameters

• method (str) – Optimization method. Default is “tnc”. For large models “L-BFGS-B”is recommended. See scipy.optimize.minimize().

• jac (bool) – Whether or not to provide the gradient (computed via autograd) to theoptimizer. If False, optimizers requiring gradients will typically approximate it via finitedifferences.

• hess (bool) – Whether or not to provide the hessian (computed via autograd) to theoptimizer.

• hessp (bool) – Whether or not to provide the hessian-vector-product (via autograd)to the optimizer

• printfreq (int) – Log current progress via logging.info() every printfreq iter-ations

Return type scipy.optimize.OptimizeResult

set_data(sfs, length=None, mem_chunk_size=1000, non_ascertained_pops=None,use_pairwise_diffs=True)

Set dataset for the model.

Parameters

• sfs (Sfs) – Observed SFS

• length (float) – Length of data in bases. Overrides sfs.length if set. Required ifDemoModel.muts_per_gen is set and sfs.length is not.

• mem_chunk_size – Controls memory usage by computing likelihood in chunks ofSNPs. If -1 then no chunking is done.

• non_ascertained_pops – Don’t ascertain SNPs within these populations. That is,ignore all SNPs that are not polymorphic on the other populations. The SFS is adjusted torepresent probabilities conditional on this ascertainment scheme.

• use_pairwise_diffs – Only has an effect if DemoModel.muts_per_gen is set.If False, assumes the total number of mutations is Poisson. If True, models the withinpopulation nucleotide diversity (i.e. the average number of heterozygotes per population)as independent Poissons. If there is missing data this is required to be True.

set_mut_rate(muts_per_gen)Set the mutation rate.

Parameters muts_per_gen (float,None) – Mutation rate per base per generation. If un-known, set to None.

set_params(new_params=None, randomize=False, scaled=False)Set the current parameter values

Parameters






https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html#scipy.optimize.minimize


https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html#scipy.optimize.minimize

https://docs.python.org/3/library/functions.html#bool



https://docs.python.org/3/library/functions.html#int

https://docs.python.org/3/library/logging.html#logging.info

https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.OptimizeResult.html#scipy.optimize.OptimizeResult





• new_params (dict/list) – dict mapping parameter names to new values, or list ofnew values whose length is the current number of parameters

• randomize (bool) – if True, parameters not in new_params get randomly samplednew values

• scaled (bool) – if True, values in new_params have been pre-scaled ac-cording to the internal representation used by momi during optimization (see also:DemographicModel.add_parameter())

set_size(pop_name, t, N=None, g=0)Set population size and/or growth rate at time t.

The arguments t, N, g can be floats or parameter names (strings).

If N is not specified then only the growth rate is changed.

If N is specified and g is not then the growth rate is reset to 0. Currently it is not possible to change thesize without also setting the growth rate.

Parameters

• pop_name (str) – Population name

• t (float,str) – Time of event

• N (float,str) – Population size

• g (float,str) – Growth rate

simulate_data(length, recoms_per_gen, num_replicates, muts_per_gen=None, sam-pled_n_dict=None, **kwargs)

Simulate data, using msprime as backend

Parameters

• length (int) – Length of each locus in bases

• recoms_per_gen (float) – Recombination rate per generation per base

• muts_per_gen (float) – Mutation rate per generation per base

• num_replicates (int) – Number of loci to simulate

• sampled_n_dict (dict) – Number of haploids per population. If None, use samplesizes from the current dataset as set by DemographicModel.set_data()

Returns Dataset of SNP allele counts

Return type SnpAlleleCounts

simulate_vcf(out_prefix, recoms_per_gen, length, muts_per_gen=None, chrom_name=’1’,ploidy=1, random_seed=None, sampled_n_dict=None, **kwargs)

Simulate a chromosome using msprime and write it to VCF

Parameters

• outfile (str,file) – Output VCF file. If a string ending in “.gz”, gzip it.

• muts_per_gen (float) – Mutation rate per generation per base

• recoms_per_gen (float) – Recombination rate per generation per base

• length (int) – Length of chromosome in bases

• chrom_name (str) – Name of chromosome

• ploidy (int) – Ploidy

5.1. Demographic models 27














https://docs.python.org/3/library/stdtypes.html#dict








• random_seed (int) – Random seed

• sampled_n_dict (dict) – Number of haploids per population. If None, use samplesizes from the current dataset as set by DemographicModel.set_data()

stochastic_optimize(num_iters, n_minibatches=None, snps_per_minibatch=None, rgen=None,printfreq=1, start_from_checkpoint=None, save_to_checkpoint=None,svrg_epoch=-1, **kwargs)

Use stochastic optimization (ADAM+SVRG) to search for MLE

Exactly one of of n_minibatches and snps_per_minibatch should be set, as one determines theother.

Parameters

• num_iters (int) – Number of steps

• n_minibatches (int) – Number of minibatches

• snps_per_minibatch (int) – Number of SNPs per minibatch

• rgen (numpy.RandomState) – Random generator

• printfreq (int) – How often to log progress

• start_from_checkpoint (str) – Name of checkpoint file to start from

• save_to_checkpoint (str) – Name of checkpoint file to save to

• svrg_epoch (int) – How often to compute full likelihood for SVRG. -1=never.

Return type scipy.optimize.OptimizeResult

5.2 Plotting

class momi.DemographyPlot(model, pop_x_positions, ax=None, figsize=None, linthreshy=None,minor_yticks=None, major_yticks=None, color_map=’cool’,pulse_color_bounds=(0, 1), draw=True)

Object for plotting a demography.

After creating this object, call DemographyPlot.draw() to draw the demography.

Parameters

• model (DemographicModel) – model to plot

• pop_x_positions (list,dict) – list ordering populations along the x-axis, or dictmapping population names to x-axis positions

• ax (matplotlib.axes.Axes) – Canvas to draw the figure on. Defaults tomatplotlib.pyplot.gca()

• figsize (tuple) – If non-None, calls matplotlib.pyplot.figure() to create anew figure with this width and height. Ignored if ax is non-None.

• linthreshy (float) – Scale y-axis linearly below this value and logarithmically aboveit. (Default: use linear scaling only)

• minor_yticks (list) – Minor ticks on y axis

• major_yticks (list) – Major ticks on y axis

• color_map (str,matplotlib.colors.Colormap) – Colormap mapping pulseprobabilities to colors











https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.OptimizeResult.html#scipy.optimize.OptimizeResult



https://matplotlib.org/api/axes_api.html#matplotlib.axes.Axes

https://matplotlib.org/api/_as_gen/matplotlib.pyplot.gca.html#matplotlib.pyplot.gca

https://docs.python.org/3/library/stdtypes.html#tuple

https://matplotlib.org/api/_as_gen/matplotlib.pyplot.figure.html#matplotlib.pyplot.figure





https://matplotlib.org/api/_as_gen/matplotlib.colors.Colormap.html#matplotlib.colors.Colormap


• pulse_color_bounds (tuple) – pair of (lower, upper) bounds for mappingpulse probabilities to colors

add_bootstrap(params, alpha, rad=-0.1, rand_rad=True)Add an inferred bootstrap demography to the plot.

Parameters

• params (dict) – Inferred bootstrap parameters

• alpha (float) – Transparency

• rad (float) – Arc of pulse arrows in radians

• rand_rad (bool) – Add random jitter to the arc of the pulse arrows

draw(alpha=1.0, tree_color=’C0’, draw_frame=True, rad=0, pulse_label=True)Draw the demography.

This is method draws the demography by calling DemographyPlot.draw_tree(),DemographyPlot.draw_leafs(), DemographyPlot.draw_pulse(), andDemographyPlot.draw_frame(). Call those methods directly for finer control.

Parameters

• alpha (float) – Level of transparency (0=transparent, 1=opaque)

• tree_color (str) – Color of tree

• draw_frame (bool) – If True, call DemographyPlot.draw_frame() to drawtickmarks and legends

• rad (float) – Arc of pulse arrows in radians.

• pulse_label (bool) – If True, label each pulse with its strength

draw_N_legend(N_legend_values=None, title=’N’, **kwargs)Draw legend of population sizes.

**kwargs are passed onto matplotlib.axes.Axes.legend().

Parameters

• N_legend_values – Values of N to plot.

• title – Title of legend

Return type matplotlib.legend.Legend

draw_frame(pops=None, rename_pops=None, rotation=-30)Draw tickmarks, legend, and colorbar.

Parameters

• pops (list) – Populations to label on x-axis. If None, add all populations

• rename_pops (dict) – Dict to give new names to populations on x-axis.

• rotation (float) – Degrees to rotate x-axis population labels by.

draw_leafs(leafs=None, facecolors=’none’, edgecolors=’black’, s=100, zorder=2, **kwargs)Draw symbols for the leaf populations.

Parameters leafs (list) – Leaf populations to draw symbols for. If None, add all leafs.

Extra parameters facecolors, edgecolors, s, zorder, **kwargs are passed tomatplotlib.axes.Axes.scatter().

5.2. Plotting 29

https://docs.python.org/3/library/stdtypes.html#tuple










https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.legend.html#matplotlib.axes.Axes.legend

https://matplotlib.org/api/legend_api.html#matplotlib.legend.Legend





https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.scatter.html#matplotlib.axes.Axes.scatter


draw_pulse(pop1, pop2, t, p, rad=-0.1, alpha=1.0, pulse_label=True, adj_label_x=0,adj_label_y=0)

Draw a pulse.

Use DemographyPlot.iter_pulses() to iterate over the pulses, which can then be plotted withthis method.

Parameters

• pop1 (str) – Population the arrow is pointing into

• pop2 (str) – Population the arrow is pointing away from

• t (float) – Time of the pulse

• p (float) – Strength of the pulse

• rad (float) – Arc of the pulse arrow in radians.


• pulse_label (bool) – Add a label for pulse strength?

• adj_label_x (float) – Adjust pulse label x position

• adj_label_y (float) – Adjust pulse label y position

draw_tree(tree_color=’C0’, alpha=1.0)Draw the demographic tree (without pulse migrations)

Parameters

• tree_color (str) – Color of the tree


iter_pulses()Iterate over the pulse arrows, to pass to DemographyPlot.draw_pulse()

Returns iterator over tuples (pop1, pop2, t, p)

5.3 Data

momi.snp_allele_counts(chrom_ids, positions, populations, ancestral_counts, derived_counts,length=None, use_folded_sfs=False)

Create a SnpAlleleCounts object.

Parameters

• chrom_ids (iterator) – the CHROM at each SNP

• positions (iterator) – the POS at each SNP

• populations (list) – the population names

• ancestral_counts (iterator) – iterator over tuples of the ancestral counts at eachSNP. tuple length should be the same as the number of populations

• derived_counts (iterator) – iterator over tuples of the derived counts at each SNP.tuple length should be the same as the number of populations

• use_folded_sfs (bool) – whether the folded SFS should be used when computinglikelihoods. Set this to True if there is uncertainty about the ancestral allele.

















class momi.SnpAlleleCountsThe allele counts for a list of SNPs.

To create a SnpAlleleCounts object, use SnpAlleleCounts.read_vcf(), SnpAlleleCounts.load(), or snp_allele_counts(). Do NOT call the class constructor directly, it is for internal use only.

classmethod concatenate(to_concatenate)Combine a list of SnpAlleleCounts into a single object.

Parameters to_concatenate (iterator) – Iterator over SnpAlleleCounts


dump(f)Write data in JSON format.

Parameters f (str,file) – filename or file object. If a filename, the resulting file is gzipped.

extract_sfs(n_blocks)Extracts SFS from data.

Parameters n_blocks (int) – Number of blocks to split SFS into, for jackknifing and boot-strapping

Return type Sfs

classmethod load(f)Load SnpAlleleCounts created from SnpAlleleCounts.dump() or python -m momi.read_vcf ...

Parameters f (str,file) – file object or file name to read in


classmethod read_vcf(vcf_file, ind2pop, bed_file=None, ancestral_alleles=True,info_aa_field=’AA’)

Read in a VCF file and return the allele counts at biallelic SNPs.

Parameters

• vcf_file (str) – VCF file to read in. “-” reads from stdin.

• ind2pop (dict) – Maps individual samples to populations.

• bed_file (str,None) – BED accessibility regions file. Only regions in the BED fileare read from the VCF. The BED file is also used to determine the size of the data in bases,so the same BED file should NOT be used across multiple VCFs (otherwise regions willbe double counted towards the data length). If no BED is provided, all SNPs in the VCFare read, and the length of the data is set to be unknown.

• ancestral_alleles (bool,str) – If True, use the AA INFO field to determineancestral allele, skipping SNPs missing this field. If False, ignore ancestral allele informa-tion, and set the SnpAlleleCounts.use_folded_sfs property so that the foldedSFS is used by default when computing likelihoods. Finally, if ancestral_allelesis a string that is the name of a population in ind2pop, then treat that population as anoutgroup, using its consensus allele as the ancestral allele; SNPs without consensus areskipped.

• info_aa_field (str) – The INFO field to read Ancestral Allele from. Default is“AA”. Only has effect if ancestral_alleles=True.


5.3. Data 31












momi.site_freq_spectrum(sampled_pops, freqs_by_locus, length=None)Create Sfs from a list of dicts of counts.

freqs_by_locus[l][config] gives the frequency of config on locus l.

A config is a (n_pops,2)-array, where config[p][a] is the count of allele a in population p. a=0represents the ancestral allele while a=1 represents the derived allele.

Parameters

• sampled_pops (list) – list of population names

• freqs_by_locus (list) – list of dict

• length (float,None) – Number of bases in the dataset. Set to None if unknown.

Return type Sfs

class momi.SfsClass for representing observed site frequnecy spectrum data.

To create a Sfs object, use SnpAlleleCounts.extract_sfs(), Sfs.load(), orsite_freq_spectrum(). Do NOT call the class constructor directly, it is for internal use only.

config_arrayArray of unique configs.

This is a 3-d array with shape (n_configs, n_pops, 2). config_array[i, p, a] gives thecount of allele a in population p in configuration i. Allele a=0 is ancestral, a=1 is derived. Sfs.sampled_pops gives the ordering of the populations.

Returns array of configs

Return type numpy.ndarray

dump(f)Write Sfs to file

Parameters f (str,file) – Filename or object. If name ends with “.gz” gzip it.

fold()

Returns A copy of the SFS, but with folded entries.

Return type Sfs

freqs_matrixSparse matrix representing the frequencies at each locus.

The (i,j)-th entry gives the frequency of the i-th config in Sfs.config_array at locus j

Return type scipy.sparse.csr_matrix

lengthLength of data in bases. None if unknown.

Return type float

classmethod load(f)Load Sfs from file created by Sfs.dump() or python -m momi.extract_sfs

Parameters f (str,file) – file object or file name to read in

Return type Sfs






https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html#numpy.ndarray


https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix




n_lociThe number of loci in the dataset

Return type int

n_snps(vector=False)Number of SNPs, either per-locus or overall.

Parameters vector (bool) – Whether to return a single total count, or an array of countsper-locus

p_missingEstimate of probability that a random allele from each population is missing.

Missingness is estimated as follows: from each SNP remove a random allele; if the resulting config ismonomorphic, then ignore. If polymorphic, then count whether the removed allele is missing or not.

This avoids bias from fact that we don’t observe some polymorphic configs that appear monomorphic afterremoving the missing alleles.

Returns 1-d array of missingness per population


resample()Create a new SFS by resampling blocks with replacement.

Note the resampled SFS is assumed to have the same length in base pairs as the original SFS, which maybe a poor assumption if the blocks are not of equal length.

Returns Resampled SFS

Return type Sfs

sampled_nNumber of samples per population

sampled_n[i] is the number of haploids in the i-th population of Sfs.sampled_pops

Returns 1-d array of integers


sampled_popsNames of sampled populations

Returns 1-d array of strings


subset_populations(populations, non_ascertained_pops=None)Returns lower-dimensional SFS restricted to a subset of populations.

Parameters

• populations (list) – list of populations to subset to

• non_ascertained_pops (list) – list of populations to treat as non-ascertained(SNPs must be polymorphic with respect to the ascertained populations)

Returns Lower-dimensional SFS restricted to a subset of populations.

Return type Sfs

5.3. Data 33









5.4 Statistics

class momi.SfsModelFitStats(demo_model, sampled_n_dict=None)Class to compare expected vs. observed statistics of the SFS.

All methods return JackknifeGoodnessFitStat unless otherwise stated.

Currently, all goodness-of-fit statistics are based on the multinomial SFS (i.e., the SFS normalized to be aprobability distribution summing to 1). Thus the mutation rate has no effect on these statistics.

See Patterson et al 2012, for definitions of f2, f3, f4 (abba-baba), and D statistics.

Note this class does NOT get updated when the underlying demo_model changes; a newSfsModelFitStats needs to be created to reflect any changes in the demography.

Parameters

• demo_model (momi.DemographicModel) – Demography to compute expected statis-tics under.

• sampled_n_dict (dict) – The number of samples to use per population. SNPs withfewer than this number of samples are ignored. The default is to use the full sample size ofthe data, i.e. to remove all SNPs with any missing data. For datasets with large amountsof missing data, this could potentially lead to most or all SNPs being removed, so it isimportant to specify a smaller sample size in such cases.

abba_baba(A, B, C, D=None)Same as f4()

all_pairs_ibs(fig=True)Fit the IBS fraction for all pairs of populations, and optionally plot it.

Parameters fig (bool) – whether to plot it

Return type pandas.DataFrame

f2(A, B)Computes f2 statistic (A-B)*(A-B)

Parameters

• A (str) – First population

• B (str) – Second population

f3(A, B, O)Computes f3 statistic (O-A)*(O-B)

Parameters



• O (str) – Third population.

f4(A, B, C, D=None)Returns the ABBA-BABA (f4) statistic for testing admixture.

Parameters



• C (str) – Third population


http://www.genetics.org/content/192/3/1065












• D (str) – Fourth population. If None, use ancestral allele.

f_st(A, B)Returns (pi_between - pi_within) / pi_between, where pi_between, pi_within represent the average numberof pairwise diffs between 2 individuals sampled from different or the same population, respectively.

log_abba_baba(A, B, C, D=None)Returns log(BABA/ABBA) = log(BABA)-log(ABBA)

pattersons_d(A, B, C, D=None)Returns Patterson’s D, defined as (BABA-ABBA)/(BABA+ABBA).

Parameters



• C (str) – Third population

• D (str) – Fourth population. If None, use ancestral allele.

tensor_prod(derived_weights_dict)Compute rank-1 tensor products of the SFS, which can be used to express a wide range of SFS-basedstatistics.

More specifically, this computes the sum∑︁𝑖,𝑗,...

𝑆𝐹𝑆𝑖,𝑗,...𝑤(1)𝑖 𝑤

(2)𝑗 · · ·

where 𝑤(1)𝑖 is the weight corresponding to SFS entries with i derived alleles in population 1, etc. Note the

SFS is normalized to sum to 1 here (it is a probability).

Parameters derived_weights_dict (dict) – Maps leaf populations to vectors (numpy.ndarray). If a population has n samples then the corresponding vector w should havelength n+1, with w[i] being the weight for SFS entries with i copies of the derived allelein the population.

Return type JackknifeGoodnessFitStat

class momi.JackknifeGoodnessFitStat(expected, observed, jackknifed_array)Object returned by methods of SfsModelFitStats.

Basic arithmetic operations are supported, allowing to build up complex statistics out of simpler ones.

The raw expected, observed, and jackknifed_array values can be accessed as attributes of this class.

Parameters

• expected (float) – the expected value of the statistic

• observed (float) – the observed value of the statistic

• jackknifed_array (numpy.ndarray) – array of the jackknifed values of the statis-tic.

sdStandard deviation of the statistic, estimated via jackknife

z_scoreZ-score of the statistic, defined as (observed-expected)/sd

5.4. Statistics 35












CHAPTER 6

Indices and tables

• genindex

• modindex

• search

37


38 Chapter 6. Indices and tables

Index

Aabba_baba() (momi.SfsModelFitStats method), 34add_bootstrap() (momi.DemographyPlot method),

29add_growth_param() (momi.DemographicModel

method), 23add_leaf() (momi.DemographicModel method), 23add_parameter() (momi.DemographicModel

method), 24add_pulse_param() (momi.DemographicModel

method), 24add_size_param() (momi.DemographicModel

method), 24add_time_param() (momi.DemographicModel

method), 25all_pairs_ibs() (momi.SfsModelFitStats method),

34

Cconcatenate() (momi.SnpAlleleCounts class

method), 31config_array (momi.Sfs attribute), 32

DDemographicModel (class in momi), 23DemographyPlot (class in momi), 28draw() (momi.DemographyPlot method), 29draw_frame() (momi.DemographyPlot method), 29draw_leafs() (momi.DemographyPlot method), 29draw_N_legend() (momi.DemographyPlot method),

29draw_pulse() (momi.DemographyPlot method), 29draw_tree() (momi.DemographyPlot method), 30dump() (momi.Sfs method), 32dump() (momi.SnpAlleleCounts method), 31

Eextract_sfs() (momi.SnpAlleleCounts method), 31

Ff2() (momi.SfsModelFitStats method), 34f3() (momi.SfsModelFitStats method), 34f4() (momi.SfsModelFitStats method), 34f_st() (momi.SfsModelFitStats method), 35fit_within_pop_diversity()

(momi.DemographicModel method), 25fold() (momi.Sfs method), 32freqs_matrix (momi.Sfs attribute), 32

Gget_params() (momi.DemographicModel method),

25

Iiter_pulses() (momi.DemographyPlot method), 30

JJackknifeGoodnessFitStat (class in momi), 35

Kkl_div() (momi.DemographicModel method), 25

Llength (momi.Sfs attribute), 32load() (momi.Sfs class method), 32load() (momi.SnpAlleleCounts class method), 31log_abba_baba() (momi.SfsModelFitStats method),

35log_likelihood() (momi.DemographicModel

method), 25

Mmove_lineages() (momi.DemographicModel

method), 25

Nn_loci (momi.Sfs attribute), 32n_snps() (momi.Sfs method), 33

39


Ooptimize() (momi.DemographicModel method), 26

Pp_missing (momi.Sfs attribute), 33pattersons_d() (momi.SfsModelFitStats method),

35

Rread_vcf() (momi.SnpAlleleCounts class method), 31resample() (momi.Sfs method), 33

Ssampled_n (momi.Sfs attribute), 33sampled_pops (momi.Sfs attribute), 33sd (momi.JackknifeGoodnessFitStat attribute), 35set_data() (momi.DemographicModel method), 26set_mut_rate() (momi.DemographicModel

method), 26set_params() (momi.DemographicModel method),

26set_size() (momi.DemographicModel method), 27Sfs (class in momi), 32SfsModelFitStats (class in momi), 34simulate_data() (momi.DemographicModel

method), 27simulate_vcf() (momi.DemographicModel

method), 27site_freq_spectrum() (in module momi), 31snp_allele_counts() (in module momi), 30SnpAlleleCounts (class in momi), 30stochastic_optimize()

(momi.DemographicModel method), 28subset_populations() (momi.Sfs method), 33

Ttensor_prod() (momi.SfsModelFitStats method), 35

Zz_score (momi.JackknifeGoodnessFitStat attribute), 35

40 Index

momi documentation - media.readthedocs.org filechapter 1 introduction momi (moran models for...

Documents