in copyright - non-commercial use permitted rights ...5197/eth... · phylogenetic analysis of hiv...

44
Research Collection Master Thesis Phylogenetic analysis of HIV samples from a single host Author(s): Vyas, Rounak Publication Date: 2011 Permanent Link: https://doi.org/10.3929/ethz-a-006936751 Rights / License: In Copyright - Non-Commercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research Collection . For more information please consult the Terms of use . ETH Library

Upload: others

Post on 17-Jun-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

Research Collection

Master Thesis

Phylogenetic analysis of HIV samples from a single host

Author(s): Vyas, Rounak

Publication Date: 2011

Permanent Link: https://doi.org/10.3929/ethz-a-006936751

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For moreinformation please consult the Terms of use.

ETH Library

Page 2: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

Phylogenetic Analysis of HIVSamples from a Single Host

Master Thesis

Rounak Vyas

November 20, 2011

Advisors: Prof. Niko Beerenwinkel, Dr. Osvaldo Zagordi

Computational Biology Group, ETH Zurich

Page 3: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

Contents

Contents i

1 Introduction 11.1 AIDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 HIV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Longitudinal Studies . . . . . . . . . . . . . . . . . . . . . . . . 41.4 HIV Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5 Recent Studies of HIV-1 . . . . . . . . . . . . . . . . . . . . . . 6

2 Materials and Methods 92.1 Patient History . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Entropy Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4 Recombination Analysis . . . . . . . . . . . . . . . . . . . . . . 122.5 Molecular Clock Estimation . . . . . . . . . . . . . . . . . . . . 132.6 Poisson Fitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.7 Sliding MinPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.8 Rate of synonymous and nonsynonymous substitutions . . . 18

3 Results 203.1 Entropy Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 Molecular Clock Rate Estimation and Phylogenetic Tree Con-

struction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 Founder Virus Analysis . . . . . . . . . . . . . . . . . . . . . . 273.4 Demographic Reconstruction . . . . . . . . . . . . . . . . . . . 323.5 Sliding MinPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Bibliography 38

i

Page 4: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

Chapter 1

Introduction

The earliest well documented incident of AIDS dates back to the 1980’s.Since then, more than 25 million people have died from AIDS [4]. TheWorld Health Organization has now declared AIDS a pandemic and sincereefforts are underway around the world to identify an effective cure for thiscondition.

1.1 AIDS

As the name suggests, Acquired Immunodeficiency Syndrome (AIDS) is acondition wherein the patient’s immune system becomes severely compro-mised, enabling opportunistic infections such as pneumonia, tuberculosis,herpes, and others. These infections ultimately result in the death of thepatient. Clinically, AIDS is described as an advanced stage in an infectioncaused by Human Immunodeficiency Virus (HIV) wherein the CD4+ cellcount drops below the critical level. CD4+ cells are a special class of whiteblood cells which play a major role in recognizing foreign antigens (like bac-teria and viruses) within the body [12]. In their absence, the immune systemis not able to recognize and clear these foreign agents leading to sustainedinfections.

HIV infection can only be contracted from another infected individual throughexchange of body fluids like blood, genital fluids and breast milk. Chanceexchange of these takes place when the individuals share needles of an in-jection or engage in unprotected sexual intercourse. The infection cannot beacquired by ingestion of the virus.

An individual may remain HIV positive for several years before becomingan AIDS patient. After the onset of AIDS, the patient’s life span shortens to 8to 12 months [19]. Current therapies are able to significantly increase the lifespan of the infected individuals by delaying the onset of AIDS. Most of these

1

Page 5: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

1.2. HIV

therapies act by interfering with one of the crucial steps in replication, entryor release of the virus from the infected cell. However, due to the unusuallyhigh rate of mutation in HIV, it is able to quickly develop resistance againstthese therapies and flourish again. Hence there is presently no cure for AIDS[29].

AIDS is not a disease that targets a specific organ, but a condition charac-terized by progressive immune failure leading to infections in several or-gans. To develop an effective therapy, it is imperative to gain insight onthe processes through which the virus evolves to establish a nonperishablepopulation within the host. Of particular interest are the evolutionary pat-tern the virus undergoes while subjected to selective pressures of the hostimmune system, and the development of viral drug resistance. A detailedunderstanding of these processes may offer insight into the development ofeffective treatment strategies.

1.2 HIV

The Human Immunodeficiency Virus belongs to the family of retroviruses[3], RNA viruses that use reverse transcriptase to encode their genetic ma-terial into DNA only within a host cell. It is known to cause AcquiredImmunodeficiency Syndrome [33]. There are two prominent types knownas HIV-1 and HIV-2, differing in their virulence, infectivity, and prevalence[15]. These have originated from the Simian Immunodeficiency Virus sub-types cpz and smm and infect chimpanzees and old world monkeys respec-tively. In this report we focus on HIV-1 as this is the virus type with whichboth patients were infected.

Figure 1.1: Diagram of Human Immunodeficiency Virus [1]

The virus has of two identical copies of the complete genome encoded ontwo positive single RNA strands and consists of nine genes encoding 19 vi-

2

Page 6: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

1.2. HIV

ral proteins. The viral core is composed of the Capsid Protein (CA, p24),Matrix protein (MA, p17) and P6. Following reverse transcription of the vi-ral genome by reverse transcriptase subsequent to host infection, the newlyproduced DNA is incorporated into the host genome by viral integrase [25].The pre-proteins encoded by the viral genome are converted to fully func-tioning HIV proteins by protease. Following host infection RNAse H breaksdown the retroviral genome.

The HIV genome codes for a series of proteins serving structural and reg-ulatory functions. Structural proteins include gp120 which lies outside thevirus particle and gp41 just inside the membrane, with gp41 serving as amembrane anchor for gp120. Tat (transactivator) is a regulatory gene thataccelerates the production of viral progenies and is known to be a crucialprotein for HIV replication. Rev stimulates the production of HIV proteinsbut suppresses the expression of other regulatory genes of HIV. Nef (neg-ative replication factor gene) encodes for proteins that are exposed to thecytoplasm of the host cell and are necessary for viral spread and diseaseprogression by down-regulating the CD4 count. Vif encodes the Viral Infec-tivity Factor found inside the virus and is responsible the rapid spread ofthe virus. Vpr (Viral protein R) accelerates the production of HIV proteinsand interferes with the host cell cycle thus inhibiting the cell division. Vpu(Viral protein U) helps in assembling new virus particles, budding out fromthe host cell, and accelerates the degradation of CD4 proteins.

Figure 1.2: HIV-1 genome HXB2 strain [17]

HIV Life Cycle

HIV enters lymphocytes by binding to the chemokine and CD4 receptorspresent on the cell surface [10, 39]. This binding is facilitated by viral gp160protein (gp120 and gp41 proteins) [10, 39]. Following binding, the viralenvelope fuses with the cell membrane and releases the HIV capsid into thecell. Besides CD4+ cells, HIV can also infect macrophages and dendriticcells [10, 39].

Once the viral capsid enters the cell, the viral RNA is reverse transcribedto a cDNA molecule [25]. This process is facilitated by the reverse transcrip-tase enzyme which is extremely error prone and also lacks the proof reading

3

Page 7: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

1.3. Longitudinal Studies

capacity leading to a misincorporation rate of 10−4 to 10−5 per base or ap-proximately one mis-incorporation per genome per replication cycle [37].The cDNA and its complement form a double stranded viral DNA whichis then transported into the cell nucleus where it is subsequently integratedinto the host genome with the help of integrase viral enzyme [25].

Once incorporated, viral DNA requires cellular transcription factors to en-code for viral proteins. During viral replication, the pro-viral DNA is tran-scribed into mRNA which is spliced and then transported to the cytoplasmwhere it is translated into viral proteins (mainly Tat and Rev Proteins). Revprotein accumulates in the nucleus and inhibits mRNA splicing and the un-spliced full length mRNA leave the nucleus to enter the cytoplasm [32]. Thefull length mRNA is actually the viral genome which binds to gag proteinand is packaged into new viral packets. After processing by the host en-doplasmic reticulum gp160 is transported to the plasma membrane wheregp41 anchors gp120 to the membrane of the infected host cell. The viralcapsid then assembles and buds out of the cell to infect other cells [25].

The high rate of mutation in HIV is due to the high error rate of the reversetranscriptase enzyme while transcribing the viral RNA genome into a DNAsequence that can be incorporated in the host cell genome for coding viralproteins [5]. Along with a high misincorporation rate of approximately onebase per replication, reverse transcriptase also lacks proof-reading activityrendering it unable to check and rectify copy mistakes, often resulting inseveral slightly different copies of HIV within a single patient. Distinct viralpopulations are referred to as ”quasispecies”, and each genetically distinctindividual is referred as a haplotype.

1.3 Longitudinal Studies

Due to the prohibitively expensive nature of prospective HIV screening, HIVstudies are generally only performed on high risk population groups, suchas within a prison. Combined with additional ethical considerations, theresult is that most studies enroll patients that are already symptomatic. Inalready symptomatic populations the viral load is already established andthus is of little insight into the dynamics of the pre-seroconversion phase, i.e.before any viral antibody production has taken place.

To develop insight into the temporal evolution of the virus within a host,longitudinal studies of patient populations are of critical importance. Longi-tudinal studies combine data collected at multiple examinations at intervalsbetween minutes and years to afford a more comprehensive insight into theviral dynamics than is possible through examinations at a single time pointalone.

4

Page 8: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

1.3. Longitudinal Studies

Figure 1.3: HIV Life Cycle [6]

While longitudinal studies are clearly desirable, they also present technicalchallenges such as censoring of events due to the relocation, death or dis-enrollment of cohort members. Additionally, changing patient habits and alifestyle choice can complicate analysis. Lastly, longitudinal studies are nec-essarily more involving and therefore more expensive than single time-pointstudies.

5

Page 9: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

1.4. HIV Sequencing

1.4 HIV Sequencing

Traditional Sanger-based sequencing method only read the consensus ge-nomic sequence of heterogeneous viral populations [35], obfuscating thegenomic variability present in the population which is of potential impor-tance for identifying gradually-fixating resistance mutations. Next Genera-tion Sequencing (NGS) technologies represent an improvement over Sangersequencing and facilitate the sequencing of distinct haplotypes within a sam-ple [30]. However, NGS reads are error-prone and require sophisticatedprocessing techniques to create error-free haplotype reconstruction and fre-quency estimation. Currently, haplotypes with frequencies as low as 0.05%can be estimated with 99% confidence [24].

1.5 Recent Studies of HIV-1

In 1999 R. Shankarappa et al. conducted a landmark study investigating theevolution of HIV within an infected individual prior to the onset of AIDS[22]. They studied the evolution of C2-V5, a high mutation region of theHIV-1 env gene, in nine patients over six to twelve years. They estimatedthe viral diversity within and between time points, identified mutations con-ferring the viral strain any fitness advantage, and characterized the existenceof three distinct phases: the early phase with linear increase in diversity anddivergence from the founder virus strain, an intermediate phase with linearincrease in divergence but stabilization or decline in diversity, and a latephase with stabilization in the divergence and continued decrease in diver-sity.

More recently, Poon et al, used longitudinal deep sequencing data with coa-lescent analysis to estimate the date of HIV infection [26]. Time of infectionwas estimated using the time to most recent common ancestor (TMRCA) ofa time-calibrated phylogenetic tree relating sequences from all time points.This is justified by the argument that most HIV infections are established bya single viral strain due to bottlenecks during transmission [38, 13]. 19 HIVpositive individuals were followed and 7 genomic regions were analyzed.The authors compared the estimated time since infection from experimen-tal methods to TMRCA estimates obtained with the BEAST software library.They observed a stronger correlation between clinical and computational es-timates for TMRCA in highly variable regions of HIV genome (such as env)relative to that in conserved regions such as pol. The reduced correlation inthe conserved regions is thought to be due to a possible overestimation oftime scales due to the increased sensitivity of the coalescent based methodstowards the sampled genetic variation. Consequently, sequences with highdivergence were found to be ideal for calibrating the evolutionary clock. Inthe case of a multiple founder virus infection, this method was found to

6

Page 10: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

1.5. Recent Studies of HIV-1

overestimate the infection time.

In the same month another interesting study was published by a differentauthor, Suzanne English, et al. [23]. This discussed the construction of thetransmission history of HIV-1 infected individuals using Phylogenetic meth-ods. It showed that the diversity is fairly limited in the early phase of theinfection and is even compatible with the transmission of a single viral vari-ant. It also provided evidence to support the idea that a single donor canin principle transmit two distinct variants to two different individuals in asmall time span of few hours. The transmission history was constructedusing the Bayesian and Maximum likelihood approach. Env, gag and pol re-gions were used for this analysis. The inter host genetic diversity predictionsproportionately varied depending on the extent of conservation observed inthe region used for its prediction. Highest diversity was predicted usingenv gene (least conserved) followed by pol gene and then gag gene. Trans-mission history was constructed using the inter-host variation observed inthese three regions. BEAST software was used for carrying out this analysis.

A temporal study on HIV-1 was undertaken by G. Achaz et al and pub-lished in 2004 [11]. This study was conducted using gag-pol sequence datacollected over time from two chronically infected individuals to estimate thepopulation structure and the neutral mutation rate of this region per site pergeneration. Neutral coalescent models were used for the analysis. For thegenealogy construction, coalescent approach identical to the one proposedby Felsenstein 1999 was used. 19 time points collected over a period of 4years were used for the analysis. This compensated for the low mutationrate in the sequences.

A longitudinal study to understand the viral evolution in early Acute Hep-atitis C Virus infection was carried out by Bull RA, Luciani F, McElroy K,Gaudieri S, Pham ST, et al. published in 2011 [9]. We discuss this paperin greater detail due to the parallelism with our study. This study aimedat identifying genetic variants as low as 0.1% frequency and subsequentlyquantify them over the course of infection. They also identified two sequen-tial bottlenecks that occurred early in infection. BEAST software was used toestimate the changes in the effective population size of the virus populationover time. It was also used to construct ancestor descendant relationshipswith the viral samples from different time points. The rate of evolution forthe virus was also estimated during this analysis. In depth nonsynonymousand synonymous substitution analysis was carried out to identify any vis-ible pattern of change. Entropy changes were measured across the wholegenome and across patients which indicated non-uniform evolution of HCVacross the genome and over time. Single founder virus hypothesis weretested for infection using the freely available tool called as Poisson Fitter onthe HIV database.

7

Page 11: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

1.5. Recent Studies of HIV-1

Several other time series data analysis on HIV positive patients have beencarried out for identifying/understanding the order in which resistance mu-tations are accumulated when the patient is placed under a drug therapy.However, we do not discuss these since our patients did not show any drugresistance even after the therapy was discontinued.

8

Page 12: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

Chapter 2

Materials and Methods

2.1 Patient History

HIV samples were collected from two patients enrolled at the departmentof infectious diseases at Universitatspital Zurich.The protease coding regionof HIV was deep-sequenced and analyzed. This region was chosen for thestudy since both the patients were treated with a protease inhibitor drug.

Patient I.D.123

Figure 2.1: Viral load in patient 123.

This patient was a part of the Primary HIV Infection study which empha-sizes on beginning the treatment in the early phase of infection and thendiscontinuing it. It is based on the assumption that the patient is likely tocontrol the virus when the treatment is started early. However, most patients

9

Page 13: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

2.2. Data Pre-processing

Table 2.1: Sample Collection Time Points, Patient Id:123

Sr.No. Sample Name Sample Collection Date1 PR1 12.12.20032 PR 2 15.12.20053 PR 28 30.05.20064 PR 3 14.08.2007

suffer from a viral rebound, like patient 123. As can be seen in figure 2.1,four samples over a period of 3.74 years were sequenced from the patientafter being tested as HIV positive. These samples have been marked in red.The regions marked as ART in figure 2.1 show the periods of treatment withLopinavir, an anti-retroviral drug. The exact dates of sample collection canbe seen in table 2.1

Patient I.D.181

Figure 2.2: Viral load in patient 181

The patient remained untreated until almost an year after being tested HIVpositive. During this phase, the viral load in blood plasma was regularlymonitored. Samples from three time points marked with red in figure 2.2were deep-sequenced and analyzed. The exact dates of sample collectionhave been mentioned in table 2.2.

2.2 Data Pre-processing

Haplotype reconstruction and error correction was performed using ShoRAH[24]. The output file contained haplotype sequences in FASTA format. Theheader of each haplotype contained two numbers. First number showed our

10

Page 14: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

2.3. Entropy Analysis

Table 2.2: Sample Collection Time Points, Patient Id:181

Sr.No. Sample Name Sample Collection Date1 PR4 28.09.20052 PR 5 15.03.20063 PR 6 08.09.2006

confidence in the haplotype sequence on a scale of 0 to 1. The other num-ber could be used to calculate the frequency of the haplotype in the samplepopulation. It showed the number of times the sequences constituting thehaplotype were sequenced. It is known as the average read number of ahaplotype.

These files often contained over a hundred sequences with only a few hav-ing a high read count and confidence. For a meaningful analysis, these fileswere filtered to select sequences with a confidence of over 0.9. This reducedthe number of haplotypes to one quarter or less. The threshold was chosento optimize the number of sequences for analysis. Too few sequences wouldnot contain enough information for the analysis and too many would cer-tainly add noise to the result. This cutoff returned a reasonable number ofhaplotypes.

Since a functional protease is fundamental for HIV, any gaps present in thehaplotype sequences were assumed to be sequencing errors. The haplotypesfrom a single run were used to first construct a consensus sequence usingthe Biopython EMBOSS tool known as “Cons”. Any gaps present in the con-sensus sequence were filled using HXB2 protease reference sequence. Theconsensus sequence was then used to fill the gaps present in the haplotypesequences. The reading frame of every haplotype was also ensured to startat the first nucleotide position. A python script was written for performingall the above tasks.

2.3 Entropy Analysis

HIV constantly accumulates mutations to cope with the selective forces be-ing exerted by the immune system and drug treatments. The nucleotidesites that accumulate these mutations are mainly responsible for renderingthe virus resistant to different therapies. In order to improve our currentmethods, it would be fruitful to identify these sites and also have an insighton how these sites maintain diversity in the viral population. For this pur-pose we calculate entropy for every time point dataset and try to identifyany visible spatial or temporal patterns.

11

Page 15: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

2.4. Recombination Analysis

Entropy of a position in a sequence depicts the uncertainty associated withthe nucleotide present at the site. High entropy indicates that the site canhave variable nucleotides.

Let X be a discrete random variable ( bases while considering nucleotides,amino acids while considering proteins), taking a finite number of possiblevalues x1, x2, . . . , xn with probabilities p1, p2, . . . , pn such that pi ≥ 0, i =1, 2 . . . , n ∑n

i=1 pi = 1. The entropy is then given by

Hn(p1, p2, ..., pn) = −n

∑i=1

pilogb pi

Here b is the base of the logarithm. A simple python script was writtento calculate and plot the entropy at every position in the alignment, weused natural logarithm for our calculations. When deep-sequencing datawas submitted as an input, the script could take into account the averagenumber of reads while calculating the entropy at every position.

2.4 Recombination Analysis

Recombination plays a crucial part in the evolution of retroviruses and ismore prevalent in conserved regions [7]. Since we use a fairly conservedHIV region for our analysis, we performed a recombination detection study.If an alignment contains recombinant sequences then relationships betweendifferent segments of the alignment cannot be described using a single phy-logenetic tree. To unfold the true evolutionary relationships, it is imperativeto identify the recombination break points and partition the alignment intothe number of observed recombinant sets and then depict the evolutionaryrelationships in each of these partitions using a separate phylogenetic tree.If recombination events are not taken into account during a phylogeneticanalysis, then the results are most likely to be meaningless.

We used Recombination Identification program [36] which has been developedto specifically detect recombinants in HIV-1 nucleotide sequences. It acceptsa set of nucleotide sequences from a single viral genomic region collectedfrom a single patient as an input. The program requires a background se-quence which is essentially the consensus sequence of the genomic regionthat is to be analyzed. This can be selected from the available list in theprogram; alternatively the user is free to submit a consensus sequence withthe nucleotide data. In the latter case, the consensus should be aligned tothe rest of the sequences.

This program detects recombinants by sliding a window of pre-specifiedlength along the alignment and calculating the hamming distance of the

12

Page 16: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

2.5. Molecular Clock Estimation

query sequence from all other sequences. The best match within every win-dow is qualified and the confidence in each match is calculated using a z-test.If two neighboring windows on the same sequence have best matches withdifferent sequences then it is considered as a recombinant. The programimplicitly assumes each site to be evolving independently but according tothe same process. It also approximates the binomial distribution of the ham-ming distances by a normal distribution.

2.5 Molecular Clock Estimation

This section closely follows “The Evolutionary Analysis of Measurably Evolving Populations using Se-

rially Sampled gene sequences” by Allen Rodrigo, et al [21] and “Estimating Divergence times” by J.L

Throne, H.Kishino [28]

Our interest in estimating the rate of evolution comes from our desire to con-struct rooted, time scaled phylogenetic trees using serially sampled data. Aphylogenetic tree using only contemporary sequences can be constructed us-ing standard approaches like maximum parsimony and N-J method whichassume that all the input sequences belong to a single time point and aretherefore equally distant from the root of the tree [34]. This is not thecase with serially sampled sequences and care must be taken to scale thebranches according to their time of sampling. Rate of mutation is requiredfor this scaling of branches. Unlike the standard tree construction techniqueswhere the branch lengths are calculated using a composite parameter µt,where µ is the substitution rate and t is the sampling time, with serially sam-pled data these two parameters can be decoupled into time and substitutionrate and the tree branches can be expressed in units of either.

The rate of molecular evolution is an outcome of a complex interplay be-tween the biological systems and their surroundings. Since these systemsand their surroundings change over time, it is inherent that their evolution-ary rates would also fluctuate. These fluctuations in rates over differentperiods are best described as the relaxed molecular clock. In the case ofHIV, the rate of evolution is influenced by the rate of mutation per repli-cation cycle, the generation length as well as the probability of fixation ofthe mutation in the viral population. All these factors depend intricatelyon the biology as well as the population size of HIV. When the populationsize fluctuates, so does the fixation probability of a mutation resulting inthe change of selection pressure on the virus. It can also be argued thatdepending on the period of observation, the evolutionary rates can be as-sumed to be approximately constant implying a strict molecular clock. Suchan assumption facilitates evolutionary studies but one should always keepin mind the scenarios when the weakness of this assumption out competesits convenience.

13

Page 17: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

2.5. Molecular Clock Estimation

The molecular clock model selection and rate estimation was performedusing a Java based tool, BEAST [27]. It was the natural choice for perform-ing the analysis since it implements substitution models, insertion deletionmodels, demographic models for performing a series of coherent analysis.It can also explicitly model the rate of molecular evolution on every branchof the phylogenetic tree. This rate can be constrained to be constant overall branches or can be allowed to freely vary along different lineages. Thismolecular clock model can be readily combined with other models that al-low the rate of substitution to vary along the alignment while sharing somecommon parameters such as the rate of transition or transversion. Since sev-eral models can be combined, many unnecessary simplifying assumptionscan be avoided.

BEAST provides Bayesian framework for testing hypothesis on biologicaldata. Its three main genera of analysis are “constructing rooted and timemeasured phylogenies”, “estimating population change over time using coalescentbased models” and “demo-geographic sequence analysis”. We constructed timecalibrated phylogenies after estimating the clock rate and population evolu-tion plots, hence we will be discussing these two methods in detail. Demo-geographic analysis uses the location of sample collection and includes thisinformation while drawing statistical inferences.

BEAST is one of the few available platforms which can deal with timestamped data and make use of relaxed or strict molecular clock modelsto construct rooted trees and calibrate internal node ages in absolute timescales. It makes use of the Metropolis-Hastings Markov Chain Monte Carloalgorithm to provide sample based estimates of the posterior distributionsof the evolutionary parameters given a set of sequence data. It facilitatesanalysis of multi-locus data since the data can be appropriately partitionedand the evolutionary parameters can be linked/unlinked between partitions.This feature can be extremely helpful when dealing with viral sequenceswith genes e.g. Pol and Env which have different rates of mutations. Insuch a situation, the demographic model parameters can be shared betweenpartitions assuming exponential or logistic growth while the substitutionmodel parameters can be unlinked across different partitions.

Bayesian Skyline is a non-parametric demographic model available in BEASTto estimate the past population dynamics using time calibrated sequencedata. Past populations are reconstructed by estimating the demographicmodel parameters using the Bayesian methods [14]. It can estimate theevolutionary rate, substitution model parameters, phylogeny and ancestralpopulation dynamics within a single run. It then plots the past populationevolution over time. The plot begins from the estimated root age of thephylogenetic tree.

14

Page 18: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

2.5. Molecular Clock Estimation

Model Summary

The model first estimates a phylogenetic tree to explain the relationshipbetween n contemporaneous sequences. This is the genealogy, denoted byg. The coalescent events are then assumed to occur only on internal nodesof the tree, i.e. there can be maximum of n-1 coalescent events occurringon the tree. The population can remain unchanged after the occurrence ofa coalescent event. The indicator function Ic(i) is used to denote whetherthe ith event is a coalescent. The times at which the coalescent events occurare denoted using a vector u = (u1, u2, . . . , un−1). The period where thepopulation size remains unchanged is called as an interval and the vectorused to denote the number of coalescent events in each interval is A =(a1, a2, . . . , am). Here m is the total number of such intervals with 1 < m <n − 1. The time at which each grouped interval ends is denoted by w =(w1, . . . , wm) and the vector of effective populations sizes is denoted usingΘ = (θ1, θ2, . . . , θm).

The vectors denoting the effective population size together with the geneal-ogy g and the vector of number of coalescent events in each interval A consti-tute to the demographic and coalescent time parameters. fG(g|θ, A) denotesthe probability of the genealogy and it can be easily calculated. BEAST usesa fixed number of coalescent events m since the resulting posterior demo-graphic function is consistent for a large range of its values. The vector ofeffective population size are sampled using a MCMC algorithm. Each newpopulation size is sampled from a exponential distribution with a meanequal to the previous population size. This formulation represents our be-lief that the population size is autocorrelated through time.

The posterior distribution sampled is the product of the likelihood of piece-wise demographic model and the priors.

fhet(Θ, A, Ω, g, µ|D) =1Z

Pr(D|µ, g) fG(g|Θ, A) fΘ(Θ)X fA(A) fΩ(Ω) fµ(µ)

where, fθ(θ1) is the scale invariant prior for the first effective population sizeand the rest are drawn from an exponential distribution which is centeredaround the size of previous population. Ω contains the parameters of thesubstitution model and µ is the mutation rate that scales the genealogies(phylogenetic tree) from units of mutations per site to units of time.

The sampled posterior distribution is the product of the likelihood of piece-wise demographic model and the priors. If the sampled substitution modelparameters and mutation rates are ignored, then we get a list of states as-sociated with a genealogy and demographic parameters. Then the demo-graphic history can be constructed as a piecewise function of time for eachof the states. The marginal posterior distribution of the population size is

15

Page 19: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

2.6. Poisson Fitter

calculated for each time point till the time to most recent common ancestoralong with the 95% confidence interval that accounts for phylogenetic anddemographic uncertainty. The population estimates are usually smooth dueto the averaging effect of the sampling procedure in use.

2.6 Poisson Fitter

Freely available on http://www.hiv.lanl.gov

Studies in [38, 13] showed that HIV undergoes genetic bottlenecks whenthe mode of transmission is sexual (horizontal transfer) or mother to child(vertical transfer). This primarily results in new infections being initiated byhomogeneous viral strains. Once the infection is established by a single viralstrain, it is expected to grow exponentially until the host immune systeminitiates a response. This is a case of neutral evolution where the mutationcounts are expected to follow a poisson process. Once the host immunesystem triggers a response against the infection or when the patient is placedunder therapy, the virus population does not grow exponentially and theaccumulated mutations are no longer random and the Poisson distributioncannot be used for describing the pairwise Hamming Distance frequencydistribution.

Poisson Fitter [16] analyzes a set of HIV sequences assumed to be collectedclose to the time of infection to estimate whether the infection was initiatedby a single of a multiple founder viruses. It is based on maximum likelihoodapproach which first tests the hypothesis of a single founder virus straininitiating the infection and if this condition is met then the time of infectionis estimated with 95% confidence interval, provided the sample has beendrawn before the virus population was subject to any selection pressure.

Poisson Fitter can read deep-sequencing datasets and so was the naturalchoice for performing this analysis. Another reason for selecting this toolwas that it is specially designed for working with HIV and Hepatitis C virusdatasets and makes use of their default substitution rate. It has been usedin some other longitudinal studies that have been discussed in the literaturereview section.

This tool compares the sample genetic diversity with the diversity expectedunder the neutral growth model, i.e infection by a single viral strain accu-mulating random mutations, by performing statistical tests on the HammingDistance and fitting a Poisson distribution to the same using the maximumlikelihood method. It tests whether the phylogenetic tree for the sequencesshows a star topology. After minimizing the log likelihood function, weobtain the following value of the Poisson distribution shape parameter

λ =∑n

i=0 iYi

∑ni=0 Yi

= E(Y)

16

Page 20: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

2.7. Sliding MinPD

where Y = (Y0, . . . , Yn) are the number of pairs of sequences that have ahamming distance equal to the subscript n. The model assumes a generationtime of 2 days and a mutation rate of 2.16x10−5 per site per replication witha basic reproductive ratio R0 = 6 based on the findings from [38, 13, 18].When the sequence data shows a star phylogeny, one finds that E(Yi) = Yi.Once this condition is satisfied, the age of the root of the tree co-insides withthe time of HIV transmission.

When the goodness of fit is low, it might indicate that the sample was col-lected after the initiation of the selection pressures or the infection was ini-tiated by multiple founder viruses. Deep sequencing data can also be an-alyzed using Poisson Fitter and the plots are then on a log scale since thenumber of identical sequences are much more than the ones that differ andthis information gets masked on a linear scale.

2.7 Sliding MinPD

This section describes the methods in [8]

The traditional phylogenetic approaches treat all the sequence data as con-temporaneous data and deal with serially sampled data by merely scalingthe tips of the leaves. These methods are also not able to account for re-combination events. Furthermore, when the data is collected from quicklyevolving viruses which exhibit complex substitution patterns, phylogenetictrees are not able to depict all the information. In such a situation, an evolu-tionary network can be used to depict the ancestor descendant relationshipsand recombination events.

SlidingMinPD constructs an evolutionary network using serially sampleddata and detects recombination events using a sliding window approach. Itis based on the minimum pairwise distance approach combined with thesliding window method and recombination detection techniques.

The algorithm consists of three phases. In the first phase, every sequencethat does not belong to the first time point is deemed as the query sequenceand its pairwise distance is calculated against every other sequence fromthe previous time point. In the second phase, the breakpoints in the recom-binant sequences and their donor sequences are identified using the slid-ing window approach where the best match is identified for every windowalong the alignment. In the final step, potential ancestors from previoustime points are identified. For the non-recombinant sequences these are theones which had the shortest calculated distance in first step.

The results of this program were found to be extremely sensitive to thespecified window length. Hence the analysis was carried out for only asingle patient.

17

Page 21: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

2.8. Rate of synonymous and nonsynonymous substitutions

2.8 Rate of synonymous and nonsynonymous substitu-tions

This section closely follows [31], the chapter ”Neutral and adaptive protein evolution” by Ziheng Yangin [40] and the Hypothesis Testing for Phylogenies manual [20]

The rate of nonsynonymous and synonymous substitution provides an in-sight on the type of selection pressure acting on the viral population. Whenthe ratio of the rates of nonsynonymous and synonymous substitutions isgreater than one for a genomic region, then that region is said to be underpositive selection, e.g. when a HIV patient is placed under a drug ther-apy, the virus shows concerted substitutions towards acquiring a particularresidue which eventually fixates in the population making the virus drugresistant. This type of evolution is known as positive directional selection.Another kind of positive selection is to maintain the amino acid diversity atcertain sites which are potential targets of the host immune system. This iscommonly known as diversifying positive selection.

When the genomic region accumulates synonymous and nonsynonymoussubstitutions at the same rate, then it is said to be under neutral evolution.

In negative selection the rate of nonsynonymous substitutions is much lowerthan that of synonymous substitutions causing selective removal of allelesthat are deleterious. It is also commonly known as purifying selection.

A substitution behaves synonymous or nonsynonymous depending on thecodon in which it occurs and on the position within the codon. For example,GGX → GGY is always a synonymous substitution whereas CAX → CAYis synonymous if X → Y is a transition and nonsynonymous otherwise.Hence while dealing with coding sequences, it is always meaningful to usecodons as the units for selection analysis.

We used Mega software for calculating the rate of synonymous and non-synonymous mutations which is based on the method described by M. Neiand T. Gojobori in 1986. Here we describe the method in detail. First thenumber of synonymous and nonsynonymous sites for each codon presentin the sequence is computed.

Let S be the number of synonymous sites for each codon S= ∑3i=1 fi, where

fi is the fraction of synonymous changes at the ith position in a codon. Thenthe number of non-synonymous sites S for each codon can be calculated asN= 3–S

This can be understood by a simple example. In the case of TTA whichcodes for leucine

f1 = 13 (T → C), f2 = 0, f3 = 1

3 (A→ G) and so S = 23 , N = 7

3

18

Page 22: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

2.8. Rate of synonymous and nonsynonymous substitutions

The total number of synonymous and nonsynonymous sites in a sequenceof r codons is therefore given by S = ∑r

i=1 fi and N = (3r− S).

The number of nonsynonymous and synonymous nucleotide differences be-tween a pair of sequences is calculated by comparing the sequences codonby codon and counting the number of synonymous and nonsynonymousnucleotide differences for each pair of compared codons. This can be easilydone when the codons are differing at only a single position. When theydiffer at two nucleotide positions then there are two possible ways throughwhich this difference could have occurred. Both the paths are consideredthen with equal probability and the number of synonymous and nonsynony-mous substitutions are counted and Sd andNd are updated. For example: IfTTT codon is compared against GTA, then the two pathways are

1. TTT(Phe)→GTT(Val)→GTA(Val), one synonymous and one nonsyn-onymous substitution

2. TTT(Phe)→TTA(Leu)→GTA(Val), two nonsynonymous substitution

The value of Sd becomes 0.5 and Nd becomes 1.5 respectively. Similarly,when there are three nucleotide differences then six possible pathways be-tween the codons with three mutational steps within each pathway are con-sidered.

The proportion of synonymous and nonsynonymous differences are thencalculated using the equations ps = Sd

S and pn = NdN where S and N are

the average number of synonymous and nonsynonymous sites for the twocompared sequences. Further the per site substitutions are calculated usingthe Jukes and Cantor (1969) formula [31]:

d = −34

ln(1− 43

p)

Where p is ps and pn for synonymous and nonsynonymous substitutionsrespectively. This method gives approximate estimates and the formula isnot applicable to two and threefold degenerate sites. The program usedby us for this analysis makes use of this method for calculating the rate ofsynonymous and nonsynonymous changes.

19

Page 23: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

Chapter 3

Results

3.1 Entropy Analysis

Patient I.D. 123

The entropy was calculated and plotted for every position in the alignment,for all the four datasets, as shown in figure 3.1. . . and 3.4.

A closer look at these shows some constant peaks. The sequence regionaround these sites was explored and summarized in table 3.1. The basenumber shows the nucleotide position whose neighboring sequence is beingviewed. The following four columns show the entropy of the base at differ-ent time points. The last column shows the neighboring sequence of thebase. Constant high entropies were found in the homo-polymeric regions.These were most likely sequencing errors since the 454 sequencing technique(used in our analysis) is known to suffer from high base mis-incorporationrate in the homopolymeric regions. These regions of the alignment weremanually curated to remove the anomalies. In general, sequences from the

Figure 3.1: Patient123: Entropy plot for samples collected in 2003

20

Page 24: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

3.1. Entropy Analysis

Figure 3.2: Patient123: Entropy plot for samples collected in 2005

Figure 3.3: Patient123: Entropy plot for samples collected in 2006

Figure 3.4: Patient123: Entropy plot for samples collected in 2007

21

Page 25: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

3.2. Molecular Clock Rate Estimation and Phylogenetic Tree Construction

Table 3.1: Patient 123: neighboring sequences of high entropy sites

Base No. Time1 Time2 Time3 Time4 Sequence preceeding the site48 0.377 0.305 0.305 0.562 GGGGGG (43-48)49 0.377 0.305 0.474 0.693 GGGGGGC (43-49)294 0.377 0.655 0.586 0.562 TTTAAATTTT (285-294)

first time point showed the highest entropy measure. These values graduallydecreased over time.

Patient I.D. 181

High entropy was measured in the several regions including few homopoly-meric regions. Regions suspected with sequencing errors have been listed inthe table 3.2. These region were manually corrected to remove the sequenc-ing errors. There was no spatial or temporal trend observed for change inentropy.

Table 3.2: Patient 123: neighboring sequences of high entropy sites

Base No. Time1 Time2 Time3 Sequence preceding the site129 0.271 0.224 0.113 AAACCAAAAA (124-133)132 0.271 0.224 0.113 AAACCAAAAA (124-133)294 0.271 0.606 0.549 TTTAAATTTT (285-293)

3.2 Molecular Clock Rate Estimation and PhylogeneticTree Construction

Sequences from all the data points were used to simultaneously estimate thesubstitution rate and for constructing the phylogenetic tree depicting ances-tral descendant relationship between sequences from all time points. BEASTsoftware was used for this analysis. After performing a number of test runsto understand the effect of every parameter on the run, the following settingwas found to provide the optimal results in terms of high log-likelihoodvalue of the estimated parameters, fast convergence of the MCMC chainand low standard deviation in the distribution of parameters. The phyloge-netic tree construction runs for both the patients were performed with thesettings specified in table 3.3.

The priors specified for the BEAST run have been summarized in table 3.4.The operator setting used to explore the sample space for the parametershas been summarized in table 3.5.

22

Page 26: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

3.2. Molecular Clock Rate Estimation and Phylogenetic Tree Construction

Table 3.3: Patient 123 and Patient 181: Analysis Settings. Results shown infig: 3.7, 3.9

Site Models Parameter1. Substitution Model HKY2. Base Frequencies Estimated3. Site Heterogeneity Model Gamma+Invariant Sites4. Partition in codon position None

Clock Model1. Model Strict Clock2 .Estimate Rate Yes

Demographic model1. Tree Prior Constant size2. Starting Tree Randomly Generated

Table 3.4: Patient 123 and Patient 181: Priors for the BEAST run

Parameter Prior Bound DescriptionKappa Lognormal[1,1.25] [0,inf] HKY transition transversion

parameterFrequencies uniform[0,1] [0,1] base frequenciesalpha uniform[0,10] [0,1000] gammma base frequenciespInv uniform[0,1] [0,1] proportion of invariant sites

parameterclock.rate uniform[5.4E-5,1] [0,inf] substitution raterootHeight Using Tree Prior [3.761,inf] root height of the treeconst.popsize 1/x [0,inf] coalescent population size pa-

rameter

Let us now briefly discuss how the current choice of parameters was formu-lated. A series of runs were set up with parameter rich substitution modelslike general time reversible model. These initial runs took long to converge.As a result, simpler substitution model was selected for the analysis likeHKY model. This made a significant difference in the decreasing the conver-gence time of the chain. All initial test runs were made with the uncorrelatedlognormal clock which draws the rate of each branch from an underlyinglognormal distribution. The standard deviation (i.e. ucld.stdev parameter)estimate of the clock rate was close to zero for most of these runs. A valueclose to zero for this parameter indicated clock-like behavior of the dataset[2]. Thereafter a set of runs were made with different coalescent models likeexpansion growth model, exponential growth model and constant growthmodel. The Bayes Factor was used for selecting the best fitting model. For

23

Page 27: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

3.2. Molecular Clock Rate Estimation and Phylogenetic Tree Construction

Table 3.5: Choice of operator values for the rate estimation and phylogenetictree construction, Patient 123 and Patient 181

Operates on Type Tuning Weight DescriptionKappa scale 0.75 0.1 HKY transition-transversion

parameter of partitionFrequencies deltaExchange 0.01 0.1 frequenciesAlpha scale 0.75 0.1 gamma shape parameter of

partitionClock.rate scale 0.75 3.0 substitution rate of partitionTree subtreeSlide 0.0035 15.0 Performs the subtree-slide re-

arrangement of the treeTree wideExchange n/a 3.0 Performs global rearrange-

ments of the treeTree wilsonBalding n/a 3.0 Performs the Wilson-Balding

rearrangement of the treeTree narrowExchange n/a 15.0 Performs local rearrange-

ments of the treerootHeight scale 0.75 3.0 root height of the tree of par-

titionInternal nodeheights

uniform n/a 30.0 Draws new internal nodeheights uniformly

popSize scale 0.75 3.0 coalescent population size pa-rameter of partition

growthRate randomWalk 1.0 3.0 exponential.growthRateSubstitution rateand heights

upDown 0.75 3.0 Scales substitution rates in-versely to node heights of thetree

both the patients, the constant population model could not be rejected. Ta-ble 3.6 shows the statistics for this parameter.

Table 3.7 summarizes the statistics for patient 181. The substitution rate inHIV protease coding region for this patient was found to be slightly higher.The shape of the distribution was similar to the one on patient 123 figure 3.7.

The phylogenetic tree topologies and parameters were also sampled overthe MCMC chain. Tree and parameter values were logged once every 10,000steps and the chain was run till the effective sample size, i.e the number ofindependent draws from the posterior distribution became greater than orequal to 250 [2]. In our case it was well over 1000.

Figure 3.7 shows the ancestor descendant relationship between sequencescollected at different times from patient 123. Most of the sequences from

24

Page 28: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

3.2. Molecular Clock Rate Estimation and Phylogenetic Tree Construction

Table 3.6: Patient 123 Clock Rate Statistics

mean 1.5085E-3stderr of mean 1.922E-5median 1.4137E-3geometric mean 1.3607E-395% HPD lower 3.3823E-495% HPD upper 2.8556E-3auto-correlation time (ACT) 16204.9553effective sample size (ESS) 1261.3426

Figure 3.5: patient 123: Substitution Rate distribution. 95% confidence in-terval has been marked in blue. The red region shows the rate samplingoutside the interval

Fre

quen

cy

clock.rate

-1E-3 0 1E-3 2E-3 3E-3 4E-3 5E-30

10

20

30

40

50

60

70

80

Figure 3.6: Patient 181: Substitution Rate distribution. The 95% confidenceinterval has been marked in blue. The red region shows the rate samplingoutside the interval

Fre

quen

cy

clock.rate

-2.5E-3 0 2.5E-3 5E-3 7.5E-3 1E-2 1.25E-2 1.5E-2 1.75E-2 2E-20

25

50

75

100

125

150

175

25

Page 29: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

3.2. Molecular Clock Rate Estimation and Phylogenetic Tree Construction

Table 3.7: Patient 181 Clock Rate Statistics

mean 4.9113E-3stderr of mean 6.0229E-5median 4.4662E-3geometric mean 4.2646E-395% HPD lower 8.9876E-495% HPD upper 1.0203E-2auto-correlation time (ACT) 16264.458effective sample size (ESS) 1847.5869

the first time point share nodes with the sequences from the second and thethird time point. However, the sequences from the final time point (i.e. theones collected in 2007) appear like a segregated clade from the rest of thetree showing only a faint relationship with low frequency haplotypes fromthe first time point. In a more rigorous analysis, it was found that haplotypenumber 25 (not shown in this tree as the frequency of the haplotype was0.02%) from the first sampling point was identical to a low frequency hap-lotype number 5 from the final time point. None of other sequences fromall the other time points were found 100% identical to any of the haplotypesfrom the final time point.

There were a series of identical haplotypes found over time. Starting fromone side of the tree, we see that haplotype 6 from 2003 branches out tohaplotype 2 of 2005 which further branches to haplotype 3 from 2006. Thesethree sequences were found to be identical over time and their frequenciesconstantly dwindled between 0.8% to 1% after which the haplotype was notvisible. There were 9 such haplotypes from the first time point that appearedagain in the later stages. These have been shown in the figure 3.8. We seein this figure that the frequency of the haplotypes decreases from the firstto the second time point but it again increases in the next time point. Thesequences from the final time point show little similarity with the ones fromprevious time points.

A pairwise sequence alignment was performed between the majority haplo-type with a frequency of 65% from the the third time point and the hap-lotype from the final time point with a frequency of 86% . These twosequences showed 97% similarity. Another tree incorporating haplotypeswith frequency as low as 0.2% was constructed to trace any relationship ofthe sequences from the final time point with low frequency variants fromthe second and the third time point. However there were no variants from2005 and 2006 sharing a node with the sequences sampled in 2007.

From a total of eight haplotypes with frequency greater than 0.5% in the firsttime point, four were found to be present in the second time point. Note that

26

Page 30: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

3.3. Founder Virus Analysis

the second sample was collected after almost an year of the anti-retroviraltherapy. This leads us to conclude that the viral haplotypes were latentduring the therapy period but were quick to rebound when the therapy wasstopped. So, even though an year passed in terms of absolute time scalenothing much happened in terms of evolutionary time scale.

For patient 181, the haplotypes with frequency greater than 1% progres-sively increased over time. While there were only 3 haplotypes from thefirst time point with frequency greater than 1%, the number changed to 23for the final time point. All the sequences from the first time point could betraced over the three time points. The haplotype with the highest frequencyfrom the first time point remained to be the one with highest frequency overthe next two time points as well but the frequency decreased over time from87% to 38%. This can be clearly seen in figure 3.10.

3.3 Founder Virus Analysis

The analysis was carried out to detect if the infection was initiated by a singlefounder virus haplotype. This knowledge is often necessary in tracing thesource of infection. It is also useful for explaining the pattern of geneticdiversity observed in the patient. If the phylogenetic tree constructed usingsequences from only the first time point shows a star like phylogeny thenthe infection is likely to have been initiated by a single virus. When we seedistinct clades in the tree, the infection can be assumed to be initiated bymultiple founder viruses. Another reason for the phylogenetic tree to notshow a star like topology can be that the sequences are from a time wherethe immune selection already started shaping the viral evolution. Care mustbe taken to use samples that have been collected from a time very close tothe time of infection.

Figure 3.11 and 3.12 show the phylogenetic trees constructed using thesamples from the first time points for patient 123 and 181 respectively. Thesetrees clearly do not show a star topology that would indicate a singe foundervirus infection. Since the exact dates of infection are unknown, it might bethe case that the samples used for the analysis were from viruses that wereevolving under selective pressure exerted by the immune system. The sam-ple from patient 181 that was used for the analysis was collected after a fewmonths of infection (can be seen by looking at the patient infection time lineshown in figure 2.2), this result might be misleading. We see a single viruswith frequency equal to 86% in the first time point. It is likely that the infec-tion was started by a single founder virus but due to the selective immunepressure, the founder virus showed a divergent evolutionary pattern. Theseresults were validated using the Poisson Fitter tool which also rejected thesingle founder virus infection hypothesis.

27

Page 31: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

3.3. Founder Virus Analysis

Figure 3.7: Patient id 123: Phylogenetic Tree showing relationship betweensequences with frequency greater than 0.5%.Blue marks sequences from2003, Green marks sequences from 2005, While red and orange show se-quences from 2006 and 2007 respectively

28

Page 32: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

3.3. Founder Virus Analysis

Figure 3.8: Patient id 123: Haplotypes traced over time with their measuredfrequencies

29

Page 33: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

3.3. Founder Virus Analysis

Figure 3.9: Patient id 181: Phylogenetic Tree showing relationship betweensequences with frequency greater than 0.5%.Blue marks sequences sampledon 28.09.2005, Green marks sequences from 15.03.2006, While orange showssequences sampled on 08.09.2006

30

Page 34: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

3.3. Founder Virus Analysis

Figure 3.10: Patient id 181: Haplotypes traced over time with their measuredfrequencies

31

Page 35: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

3.4. Demographic Reconstruction

Figure 3.11: Patient id 123: Phylogenetic tree with sequences from 2003 withfrequency greater than 0.5%

Figure 3.12: Patient id 181: Phylogenetic tree with sequences from 2005 withfrequency greater than 0.5%

3.4 Demographic Reconstruction

The effective population size of the virus was plotted over the period ofinfection for patient 123 figure 3.13 and patient 181 figure 3.14. The timepoints where the sequences were collected have been marked in the plot.For patient 123 we see a slight increase in the viral population over the timecourse of infection. The results of this analysis were robust to the use ofdifferent model settings. This indicated that the sequence data containedenough information to not be sensitive to slight model mis-specification.However, since the model could not be informed about the anti-retroviraltherapy period, we do not know if the results would be sensitive to this

32

Page 36: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

3.4. Demographic Reconstruction

Figure 3.13: Patient id 123: Demographic construction of the viral popula-tion with frequency greater than 0.5%

information. As we previously concluded from figure 3.8 that the viruswas under a latency period during the therapy, one would assume that theresults of the demographic should not be affected by the therapy phase.

For patient 181, we see no change in the effective population size of thevirus over the first year of infection. The times of sample collection havebeen marked in the figure 3.14.

Figure 3.14: Patient id 181: Demographic construction of the viral popula-tion with frequency greater than 0.5%

As a proof of principle of the coalescent model used in our study, anotheranalysis was performed on a sequence dataset collected over a period of 3years from an influenza epidemic. This Demographic plot was in agreementwith the variation observed during the epidemics. This data contained se-quences with length 1700 base pairs and there were samples collected overevery few months giving a high coverage over a period of three years. The

33

Page 37: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

3.5. Sliding MinPD

results are not shown since these are freely available on the BEAST website.

3.5 Sliding MinPD

The evolutionary network was constructed for the patient 123 using the Slid-ing MinPD software. Sequences from 2005, 2006 and 2007 were used in theanalysis. The samples from the first time point were ignored since the major-ity haplotypes were same between the first and the second time point. A setof runs were set up with differing window sizes and sliding window sizes.The results of the runs were found to be sensitive to slight changes in thewindow length. Due to lack of confidence in the results of this analysis, thiswas not carried out for patient 181. The results of the run have been shownin figure 3.15 and 3.16.The parameters for the runs except the window andthe sliding window size has been mentioned in table 3.8.

Table 3.8: Parameters used for constructing the evolutionary network usingSliding MinPD

Active Recombination Detection YesRecombination Detection Option Bootscan RIPCrossover option ManyPCC threshold p0.4bootstrap recomb. tiebreaker option Yesbootscan seed E-3bootscan threshold TN93substitution model 1847.5869gamma shape - rate heterogeneity alpha 0.5show bootstrap values Nomarkers for clustering Yesclustering distance threshold T0.001clustering option by bases but post

In figure 3.15 and 3.16 the sequences in a single column belong to thesame time point. The ancestor descendant relationships are denoted usingblack dotted lines. The sequences marked in red are the recombinants andthe red line when traced back in time shows the two donor haplotypes.It can be clearly seen that when the window size is changed the number ofrecombinants change. Hence this analysis was not carried out for the secondpatient.

34

Page 38: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

3.5. Sliding MinPD

Figure 3.15: Patient ID 123: Evolutionary Network constructed using win-dow size 50 and the sliding window size of 15

35

Page 39: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

3.5. Sliding MinPD

Figure 3.16: Patient ID 123: Evolutionary Network constructed using win-dow size 50 and the sliding window size of 30

36

Page 40: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

3.6. Conclusions

3.6 Conclusions

A series of downstream sequence analysis were carried out on the timestamped data collected from two patients. We formulated an ordered se-quence in which these standardized tests could be applied to any set of deep-sequenced, single gene datasets. These tests returned information regardingthe evolutionary pattern being followed by the virus and provided estimatesfor clinically important parameters like the time of HIV transmission andthe population structure of the virus over the infection period. While therewere no special tools used to detect the effect of anti-retroviral therapy, en-tropy measure and the change in haplotype frequency over time were usedto identify any of its visible effects. Since we used coalescent based modelsfor estimating these parameters, the estimates were unique for every patient.As this method is sensitive to the amount of genetic variation present in thedata set, the estimated parameters had large confidence intervals. The overestimation of the time of HIV transmission to the host could be attributedto the phylogenetic tree structure that did not support the single foundervirus hypothesis. This implied that the age of the root of the tree showedthe time to the common ancestor of the multiple viruses that infected thepatient rather than the time of HIV transmission to the host.

The relatively conserved protease coding sequences were successful in con-structing robust phylogenetic trees showing the relationship between thehaplotypes sequenced from samples collected at different time points. Thepre and post anti-retroviral therapy samples from the patient 123 showedhigh similarity in terms of haplotype sequences and their frequencies whichled us to conclude that the virus went under hibernation during the therapybut was quick to rebound once it was discontinued. There was no evidenceof recombination found in the sample sequences. The viral population plotswere fairly smooth indicating that the effective population size remained al-most constant over the course of infection. Even though the number of hap-lotypes with frequency greater than 1% increased over time for patient 181indicating increased intra-population diversity, this was not followed witha increase in the effective population size. A closer look at the sequenceidentity revealed that these haplotypes were on an average 99% identical.The conserved regions of the virus are able to construct stable phylogenies,but these regions should be used in combination with other variable regiondataset to provide population estimates with smaller confidence intervals.Another solution would be to increase the number of sampled time pointsfor improving the coalescent based estimates.

37

Page 41: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

Bibliography

[1] http://www.niaid.nih.gov/factsheets/howhiv.htm.

[2] A rough guide to beast1.4.

[3] 61.0.6. lentivirus. International Committee on Taxonomy of Viruses, 2002.

[4] Overview of the global aids epidemic, 2006.

[5] D Baltimore. Viral rna-dependent dna polymerase: Rna-dependentdna polymerase in virions of rna tumour viruses. Nature,226(5252):1209–1211, 2000.

[6] Daniel Beyer. multiplication of hiv. 1997.

[7] M T et al. Bretscher. Recombination in hiv and the evolution of drugresistance:for the better or for worse? BioEssays, 26:180–188.

[8] P. Buendia and G. Narasimhan. Sliding minpd: Building evolutinarynetworks of serial samples via an automated recombination detectionapproach. Bioinformatics, 2007.

[9] McElroy K Gaudieri S Pham ST et al. Bull RA, Luciani F. Sequential bot-tlenecks drive viral evolution in early acute hepatitis c virus infection.PLoS Pathog, 7(9), 2011.

[10] Kim Pl Chan D. Hiv entry and its inhibition. Cell, 93(5):681–684, 2003.

[11] Achaz G. et al. Mol Biol Evol., 21(10), 2004.

[12] Cunningham et al. Manipulation of dendritic cell function by viruses.Current opinion in microbiology, 13(4):524–529, 2010.

38

Page 42: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

Bibliography

[13] Cynthia A. Derdeyn et al. Envelope-constrained neutralization-sensitive hiv-1 after heterosexual transmission. Science, pages 1134–1137, 2004.

[14] Drummond A. J. et al. Bayesian coalescent inference of past populationdynamics from molecular sequences. Molecular Biology and Evolution,7(214), 2007.

[15] Gilbert P B et al. Comparison of hiv-1 and hiv-2 infectivity from aprospective cohort study in senegal. Statistics in Medicine, 22(4), 2003.

[16] Giorgi EE et al. Estimating time since infection in early homogeneoushiv-1 samples using a poisson model. BMC Bioinformatics, 25(11), 2010.

[17] Korber et al. Numbering positions in hiv relative to hxb2cgl. HumanRetroviruses and AIDS, 1998.

[18] Lee et al. Modelling sequence evolution in acute hiv-1 infetion. J TheorBiol., 2009.

[19] Morgan D et al. Hiv-1 infection in rural africa: is there a difference inmedian time to aids and survival compared with that in industrializedcountries? AIDS, 16(4):597–632, 2002.

[20] Pond S.L. et al. Estimating selection pressures on alignments of codingsequences, 2007.

[21] Rodrigo A. et al. Mathematics of Evolution and Phylogenetics. Oxford Uni-versity Press, 2007. The Evolutionary Analysis of Measurably EvolvingPopulations using Serially Sampled gene sequences.

[22] Shankarappa et al. Consistent viral evolutionary changes associatdewith the progression of human immunodeficiency virus type 1 infec-tion. Journal of Virology, 73(12):10489–10502, 1999.

[23] Suzanne English et al. Phylogenetic analysis consistent with a clini-cal history of sexual transmission of hiv-1 from a single donor revealstransmission of highly distinct variants. Retrovirology, 8(54), 2011.

[24] Zagordi et al. Shorah: Estimating the genetic diversity of a mixed sam-ple from next-generation sequencing data. BMC Bioinformatics, 12(119),2011.

[25] Zheng Y H et al. Newly identified host factors modulate hiv replication.Immunol Lett, pages 225–234, 2005.

39

Page 43: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

Bibliography

[26] Poon Art F.Y. Dates of hiv infection can be estimated for seropreva-lent patients by coalescent analysis of serial next-generation sequencingdata. AIDS, 25(16):2019–2026, 2011.

[27] Drummond A J and Rambaut A. Beast:bayesian evolutionary analysisby sampling trees. BMC Evolutionary Biology, 7(214), 2007.

[28] H.Kishino J.L Throne. Statistical methods in molecular evolution. Springer,2005. Estimation of Divergence Times from Molecular Sequence Data.

[29] DePasquale M. P. Kartsonis N. Hanna G. J. Wong J. Finzi D. RosenbergE. Gunthard H.F. Sutton L. Savara A. Petropoulos C. J. Hellmann N.Walker B. D. Richman D. D. Siliciano R. Martinez-Picado, J. and R. T.D’Aquila. Antiretroviral resistance during successful therapy of humanimmunodeficiency virus type 1 infection. Proc. Natl. Acad. Sci. U. S. A,97(20):10948–10953, 2000.

[30] Hall N. Advanced sequencing technologies and their wider impact inmicrobiology. J of Exp. Biol., 210(Pt 9):1518–1525, 2007.

[31] Gojobori T Nei M. Simple methods for estimating the numbers of syn-onymous and nonsynonymous nucleotide substitutions. Mol Biol Evol.,3(5), 1986.

[32] V. W. Pollard and M. H. Malim. The hiv-1 rev protein. Annu. Rev.Microbiol., 52:491–532, 1998.

[33] Weiss RA. How does hiv cause aids? Science, 260(4):1273–1279, 1993.

[34] N Saitou and M Nei. The neighbor joining method:a nw method forreceonstructing phylogenetic trees. Mol Bio Evol, 4(4), 1987.

[35] Coulson AR Sanger F. A rapid method for determining sequences indna by primed synthesis with dna polymerase. J of Mol Biol, 94(3):441–448, 1975.

[36] A. Siepel and B. Korber. Scanning the database for recombinant hiv-1genomes. Los Alamos National Laboratory, pages 35–60.

[37] Dan Stowell. The molecules of hiv. 2006.

[38] SM. et al. Wollinsky. Selective transmission of human immunodefi-ciency virus type-1 variants from mothers to infants. Science, 255:1134–1137, 1992.

[39] Sodroski J Wyatt R. The hiv-1 envelope glycoproteins: fusogens, anti-gens, and immunogens. Science, 280(5371):1884–8, 1998.

40

Page 44: In Copyright - Non-Commercial Use Permitted Rights ...5197/eth... · Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors:

Bibliography

[40] Ziheng Yang. Computational Molecular Evolution. Oxford, 2006.

41