the diversity of hiv-1 - phylogenetic treephylogenetictrees.com/pdf/phd.pdfthe diversity of hiv-1 a...

The Diversity of HIV-1

A thesis submitted to The University of Manchester for the Degree of

PhD

in the Faculty of Life Sciences

2008

John Patrick Archer

2

List of Contents

List of Figures .............................................................................................. 7

List of Tables................................................................................................ 9

List of Abbreviations ................................................................................. 10

Abstract....................................................................................................... 11

Declaration.................................................................................................. 12

Acknowledgements ................................................................................... 14

Chapter 1: Biological Aspects of HIV Evolution ..................................... 15

What is HIV ................................................................................................. 15

Phylogeny of HIV-1 .................................................................................... 18

The Origins of HIV...................................................................................... 28

Recombination ........................................................................................... 33

Dealing with Global HIV-1 Diversity ......................................................... 38

Co-receptor Usage..................................................................................... 41

Remaining Chapters .................................................................................. 46

References.................................................................................................. 49

Chapter 2: Sequence Data and Phylogenetic Trees ............................... 67

Molecular Phylogeny ................................................................................. 67

Global Alignments ..................................................................................... 68

3

Local Alignments ....................................................................................... 71

Measuring Genetic Change....................................................................... 72

Phylogenetic Trees .................................................................................... 75

Neighbour Joining ..................................................................................... 76

Maximum Likelihood Trees....................................................................... 79

References.................................................................................................. 81

Chapter 3: CTree - comparison of clusters between phylogenetic trees

made easy................................................................................................... 84

Abstract....................................................................................................... 84

Introduction ................................................................................................ 85

Novel Features ........................................................................................... 85 (i) Heuristically Defining Clusters .................................................................................................. 85 (ii) Manually Defining Clusters....................................................................................................... 86 (iii) SDR And SDV.......................................................................................................................... 86 (iv) Working with Random Trees.................................................................................................... 87 (iv) Finding the Center of the Tree (COT) ...................................................................................... 87

The Heuristic clustering Algorithm .......................................................... 87 (i) The Explore Phase ...................................................................................................................... 87 (ii) The Selection Phase................................................................................................................... 88

Standard Features...................................................................................... 90

Typical Usage............................................................................................. 90

Acknowledgements ................................................................................... 91

References.................................................................................................. 92

Chapter 4: Understanding the Diversification of HIV-1 groups M and O

..................................................................................................................... 93

4

Abstract....................................................................................................... 93

Introduction ................................................................................................ 94

Methods ...................................................................................................... 96 Tree Metrics..................................................................................................................................... 96 Datasets............................................................................................................................................ 97 Phylogenetic Analysis ..................................................................................................................... 98 Random Trees.................................................................................................................................. 99

Results ........................................................................................................ 99

Discussion................................................................................................ 104

Acknowledgements ................................................................................. 110

References................................................................................................ 111

Chapter 5: Prediction of HIV-1 Recombination Breakpoint Location . 115

Abstract..................................................................................................... 115

Introduction .............................................................................................. 116

Results ...................................................................................................... 120 Model Parameters .......................................................................................................................... 120 In Vitro Breakpoint Predictions..................................................................................................... 122 Global (in vivo) Breakpoints ......................................................................................................... 125

Discussion................................................................................................ 126

Methods .................................................................................................... 129 In Vitro Recombinant Breakpoints................................................................................................ 129 In Vitro Breakpoint Distributions.................................................................................................. 129 Predicted Recombinant Breakpoints ............................................................................................. 130 Random Breakpoint Distributions ................................................................................................. 131 Breakpoint Distributions incorporating Mismatch Influence........................................................ 132 Model Breakpoint distributions ..................................................................................................... 132 Global CRFs .................................................................................................................................. 132

References................................................................................................ 134

5

Chapter 6: A Strategy for Identifying Significant HIV Sequence Diversity Using Structural Constraints .................................................................. 138

Abstract..................................................................................................... 138

Introduction .............................................................................................. 139

Results ...................................................................................................... 141 Predicting viral evolution .............................................................................................................. 141 Using predictions to inform vaccine design .................................................................................. 146

Discussion................................................................................................ 150

Methods .................................................................................................... 153 The model ...................................................................................................................................... 153 Data sets......................................................................................................................................... 154 The coverage algorithm ................................................................................................................. 154


References................................................................................................ 157

Chapter 7: Detection of Low Frequency CXCR4-Using HIV-1 with Ultra-deep Pyrosequencing.............................................................................. 162

Abstract..................................................................................................... 162

Introduction .............................................................................................. 163

Methods .................................................................................................... 169 Datasets.......................................................................................................................................... 169 Day 1 and Day 11 datasets ............................................................................................................ 169 Database entropy and amino acid usage........................................................................................ 172

Results ...................................................................................................... 172

Discussion................................................................................................ 181 Future Developments..................................................................................................................... 185


6

References................................................................................................ 189

Chapter 8: Final Discussion.................................................................... 194

Conclusion................................................................................................ 201

References................................................................................................ 202

Appendix I: HIV-1 group M Gag and Pol Trees...................................... 206

Appendix II: HIV-1 group O Gag and Pol Trees..................................... 207

Appendix III: Calculating the Reduction in Breakpoint Occurrence ... 208

Appendix IV: Amino Acid Occurrence within the p17 .......................... 209

Appendix V: TP and FP rates for the Reduction Model........................ 210

(Word Count: 48, 258)

7

List of Figures

Figure 1.1 Impact of HIV-1........................................................................ 16

Figure 1.2 Biology of HIV-1....................................................................... 17

Figure 1.3 Phylogenetic Topology of HIV-1 Group M ............................... 19

Figure 1.4 Epicentre of the HIV-1 Group M Pandemic ............................. 22

Figure 1.5 Circulating Recombinant Forms............................................... 24

Figure 1.6 HIV-1 Group O......................................................................... 27

Figure 1.7 Cross Species Transmission ................................................... 30

Figure 1.8 Recombinant origin of SIVcpz ................................................. 32

Figure 1.9 Non Random Recombination................................................... 36

Figure 1.10 Four Phylogenetic shapes ....................................................... 41

Figure 1.11 How Pyrosequencing Works.................................................... 46

Figure 2.1 Scoring Matrix for Aligning Two Sequences ............................ 70

Figure 2.2 Recursion and Trace-back....................................................... 71

Figure 2.3 Example of a Phylogenetic Tree.............................................. 76

Figure 2.4 Bootstrap Analysis ................................................................... 78

Figure 3.1 Interface for CTree................................................................... 86

Figure 3.2 Clustering Sequences on a Tree ............................................. 89

Figure 4.1 Subtype Diversity Ratio and Subtype Diversity Variance ........ 97

Figure 4.2 Phylogenetic History of HIV-1 Group M and Group O ........... 100

Figure 4.3 Subtype Diversity Ratio Distributions..................................... 101

Figure 4.4 The Center of the Group M Pandemic................................... 104

Figure 4.5 Model of HIV-1 Group M Subtype Emergence ...................... 108

Figure 5.1 Prediction of Recombinant Breakpoints................................. 119

Figure 5.2 Effects of Sequence Identity on Breakpoint Location ............ 122

Figure 5.3 Predicted Breakpoint Distributions for gp120 ........................ 124

Figure 5.4 Predicted Breakpoint Distributions the for Entire Genome .... 125

Figure 6.1 Predicting HIV Evolution at Individual Sites........................... 140

Figure 6.2 True Positives, False Positives and True Negatives ............. 143

Figure 6.3 Amino Acid Frequencies within P17 ...................................... 144

8

Figure 6.4 Distribution of Nine-mers in Relation to Coverage................. 147

Figure 6.5 Generation of Optimised Sequence Constructs .................... 148

Figure 6.6 Sequence Coverage .............................................................. 150

Figure 7.1 Sequence logos of the CCR5 and CXCR4-using viruses...... 164

Figure 7.2 Cell Entry Inhibition ................................................................ 165

Figure 7.3 Generation of a Flowgram ..................................................... 166

Figure 7.4 Protocol for Handling Pyrosequenced Data........................... 168

Figure 7.5 Segment Length, Alignment Score and Frequency ............... 173

Figure 7.6 Nucleotide Coverage Across gp160 ...................................... 174

Figure 7.7 Shannon Entropy Across gp160 ............................................ 176

Figure 7.8 Phylogenetic Tree of Day 1 and Day 11 V3 Segment Data... 181

Figure 7.9 Updated Protocol ................................................................... 187

Figure 8.1 Novel Approach to Generating Optimized Constructs ........... 196

9

List of Tables Table 7.1 Rates of Insertion and Deletion............................................... 175

Table 7.2 HIV-1 Phenotypes Counts ...................................................... 177

Table 7.3 Alternate Phenotype Counts ................................................... 179

10

List of Abbreviations

BLOSUM Blocks Amino Acid Substitution Matrices

COT Center Of Tree

CRF Circulating Recombinant Form

CTL Cytotoxic T Lymphocyte

dn Non Synonymous

DRC Democratic Republic Of Congo

ds Synonymous

FN False Negative

FP False Positive

HAART Highly Active Antiretroviral Therapy

HIV Human Immunodeficiency Virus

HPS Homopolymeric Stretch

MHC Major Histocompatibility Complex

NSI Non – Syncytium Inducing

PAL Phylogenetic Analysis Library

PAM Percent Accepted Mutation

PSSM position-specific scoring matrices

RT Reverse Transcriptase

SDR Subtype Diversity Ratio

SDV Subtype Diversity Variance

SI Syncytium Inducing

SIV Simian Immunodeficiency Virus

TN True Negative

TP True Positive

URF unique Recombinant Form

11

Abstract

The phylogeny of the Human Immunodeficiency Virus type 1 (HIV-1) is

characterized by extensive diversity. However when it comes to the

persistence of the virus not all of this diversity is relevant. In this thesis I

show that the significance of diversity, both in relation to epidemiology and

vaccine design, varies within and between the HIV-1 groups. The

substructure present within group O and at the center of the group M

pandemic was observed to be similar to that found on random tree

topologies, whereas the substructure present within globally sampled group

M strains was significantly different. Next an algorithm for generating

artificial sequence constructs from this diversity, while maximizing the

inclusion of “meaningful” potential epitopes, was developed and analyzed.

The utilization of “meaningful” diversity along with the characterization of the

significance of diversity, will have profound effects on the future of vaccine

development and control strategies. Recombination contributes significantly

to the diversity present within viruses classified HIV-1. Similar to the

extensive diversity caused by point mutations and indels, I show that not all

recombinant breakpoints are important in relation to the persistence of the

virus. Many recombination events are as a result of high sequence identity

between RNA templates – which encourages template switching during

reverse transcription. When documenting the locations of recombinant

breakpoints, their importance in relation to viral persistence should be taken

into account. Within the final research chapter of this thesis I developed a

novel protocol for detecting low frequency variants within viral populations

using pyrosequenced data. To summarize, the aim of this thesis was to

characterize the significance of HIV-1 diversity, as well as to develop novel

approaches to make this diversity both more understandable and tractable in

terms of intervention strategies.

12

Declaration No part of this thesis has been submitted in support of an application for any

degree or qualification of The University of Manchester or any other

University or Institute of learning.

Submission in Alternative Format

This thesis has been submitted in alternative format with permission from the

Faculty of Life Sciences Graduate Office. A summary of the chapters can be

found at the end of chapter 1.

13

Copyright Statement

The author of this thesis (including any appendices and/or schedules to this

thesis) owns any copyright in it (the “Copyright”) and s/he has given The

University of Manchester the right to use such Copyright for any

administrative, promotional, educational and/or teaching purposes.

Copies of this thesis, either in full or in extracts, may be made only in

accordance with the regulations of the John Rylands University Library of

Manchester. Details of these regulations may be obtained from the

Librarian. This page must form part of any such copies made.

The ownership of any patents, designs, trade marks and any and all other

intellectual property rights except for the Copyright (the “Intellectual Property

Rights”) and any reproductions of copyright works, for example graphs and

tables (“Reproductions”), which may be described in this thesis, may not be

owned by the author and may be owned by third parties. Such Intellectual

Property Rights and Reproductions cannot and must not be made available

for use without the prior written permission of the owner(s) of the relevant

Intellectual Property Rights and/or Reproductions.

Further information on the conditions under which disclosure, publication and

exploitation of this thesis, the Copyright and any Intellectual Property Rights

and/or Reproductions described in it may take place is available from the

Head of School of (insert name of school) (or the Vice-President) and the

Dean of the Faculty of Life Sciences, for Faculty of Life Sciences’

candidates.

14

Acknowledgements This thesis is dedicated to my supervisor David L. Robertson who provided

continuous ideas, advice, insight, discussion and support in a patient and

devoted manner.

I would like to thank the following people for academic support and

discussion: Kathryn Else, Simon Lovell, Marilyn Lewis, John Pinney, Vicki

Kelly, Nick Gresham, Adam Huffmann, Simon Williams, Etienne Simon-

Loriere, James Eales, Jonathan Dickerson, Jamie MacPherson, Jun Fan,

Julie Huxley–Jones, Raquel Linheiro and all staff and students of the

bioinformatics corridor.

I would also like to thanks Sara-Jane, Joel, Molly, Sorcha and my parents,

15

Chapter 1: Biological Aspects of HIV Evolution

What is HIV? The human immunodeficiency virus (HIV) is the causative agent of Acquired

Immune Deficiency Syndrome (AIDS). There are two phylogenetically

distinct types of HIV referred to as HIV-1 and HIV-2. A characteristic of HIV

infection is the relatively long infection period within the host – which in the

absence of treatment is followed by eventual progression to AIDS. The long

infection period in conjunction with population demographics has lead to

HIV-1s unique epidemiology [1] that affects over 33.2 million individuals [2]

worldwide. In 2007 with over 2.5 million people being newly infected and

another 2.1 million succumbing to the disease it is clear that much hardship

and misery is still being caused since its entry into the human population

during the first half of the twentieth century [3]. Over two thirds of infection

occurs in sub-Saharan Africa (Fig. 1.1) where access to treatment is often

non existent. Without access to long term drug treatment the mortality rate

is near 100%.

The virus itself belongs to the group of retroviruses known as lentiviruses [4].

These are spherical in shape, roughly 80 – 100 nm in diameter and possess

two usually identical copies of a single stranded RNA genome approximately

10kb in length [5]. Following reverse transcription the genome must

integrate into the host cells DNA in order to replicate (Fig. 1.2, panel A). The

lifecycle within an individual host is characterized by exceptionally high

replication [6, 7], mutation [8] and recombination [9, 10] rates which

combined with the positive selection, promoted by the hosts immune

response [7, 11, 12], result in the huge amount of diversity observed at both

an intra – [13] as well as at an inter – host level [14]. This enormous amount

of diversity is responsible for the long term persistent infection of individual

16

hosts as well as for the current lack of a preventative or therapeutic vaccine

[15].

As with all retroviruses the genome of HIV (Fig. 1.2, panel B) possesses

three major coding regions: gag – whose products are the internal viral

proteins such as those needed for the matrix (p17), the capsid (p24) and

nucleocapsid (p7) structures, pol – whose products are the enzymes

responsible for reverse transcription (reverse transcriptase) and integration

(integrase) and env – whose products consist of two types of envelope

protein gp120 and gp41 [5]. GP120 mediates virus attachment and entry

into host cells via interaction with host cell receptors. GP41 mediates

membrane fusion. Other smaller accessory genes are: vif - which

suppresses host factors (APOBEC3G and APOBEC3G) that inhibit infection,

vpr – which enhances post cell entry infectivity, vpu - which is only found in

HIV-1, vpx – which only found in HIV-2, tat – which is involved in activation

viral transcription, nef – which down regulates CD4/MHC expression and rev

– which induces nuclear export of viral RNA [5]. 5’ and 3’ LTRs mark the

ends of the genome.

Figure 1.1 Impact of HIV-1

Distribution of the impact of HIV on the global population. Figures taken from [2].

17

Figure 1.2 Biology of HIV-1

(A) The life-cycle of HIV within a host cell is represented by the black arrows. The red lines

represent viral RNA while the blue lines represent the reverse transcribed viral DNA. On

initial contact with a host cell membrane gp120 spikes on the surface of the virus must bind

to specific host receptors and co-receptors. The primary host cell receptor that gp120 binds

to is CD4. T-lymphocytes and cells of monocyte macrophage lineage both express the CD4

receptor. Secondary co-receptors, such as CXCR4 and CCR5, will be discussed in detail

under the section Co-receptor Usage and Cellular Tropism. After membrane fusion viral

RNA is reverse transcribed into DNA by viral reverse transcriptase – during this process

copy-choice recombination may occur. Viral DNA can then enter the host cell nucleolus

where it is incorporated into the hosts DNA by viral encoded integrase. Transcription of viral

RNA then takes place. Some of the RNA transcribed will be complete genomic RNA that

18

will be used within new virons while more will be used in the translation of viral proteins

required to package the genomic RNA. After packaging of genomic viral RNA the new virus

particle leaves the host cell. (B) Genome structure of HIV-1 (http://www.hiv.lanl.gov/). The

gene highlighted red is the HIV-1 specific vpu gene.

Phylogeny of HIV-1

Strains classified as HIV-1 fall into three distinct groups. These are labelled

M (Major), O (Outlier) and N (New or non M/non O) [16]. Group M is almost

entirely responsible for the global pandemic. The diversity present within the

group is extensive when compared to other rapidly evolving viral genomes

such as influenza [17]. When represented on a phylogenetic tree strains

within group M form well defined clusters. Nine of these clusters are

currently termed subtypes. These are labeled A to D, F to H, J and K [16]

and are consistent in their phylogenetic topology in relation to each other

regardless of the section of their genome being compared [14]. These

subtypes, supported by a low Subtype Diversity Ratio (SDR) [18] (Fig. 1.3,

panel A) as well as high bootstrap values [14], are roughly equidistant to

each other when represented on a phylogenetic tree (Fig. 1.3, panel B).

They also have long characteristic evolutionary branch lengths stretching

into them. As a result of these combined characterises HIV-1 group M’s

phylogeny is often described as being “double starburst” like in nature.

Other significant clusters are formed by Circulating Recombinant Forms

(CRFs). These have arisen as a result of recombination events between

divergent HIV strains within individual hosts (http://www.hiv.lanl.gov). To

qualify as a CRF three epidemiologically unlinked viruses with the same

mosaic genome and consistent phylogenetic clustering must be

characterized [16]. The nine subtypes are often referred to as being pure

recombinant free lineages. However this is misleading as recombination is

an important part of HIV’s life cycle [9, 10, 19]. The subtypes have emerged

19

through a unique epidemiological history [14] and their classification has

been largely based on global sampling bias. Recently Abecasis et al.,

observed that subtype G is actually a recombinant lineage involving

subtypes A and J as well as the designated recombinant lineage CRF02_AG

[20]. Thus despite the rigid classification system the phylogeny of HIV

should be viewed as being highly dynamic in nature.

Figure 1.3 Phylogenetic Topology of HIV-1 Group M

(A) Neighbour joining phylogenetic tree re-constructed using global-group M envelope

gp160 sequences sampled from the LANL HIV Sequence Database. The bold lettering

corresponds to the subtype designations from the Database. * represent bootstrap values

greater than 90%. (B) The relationship between the SDR and the quality of clustering on a

phylogenetic tree as defined in [18].

20

As for any genome, different regions of the HIV genome have different rates

of evolution. Between the current nine subtypes env amino acid sequences

are separated by approximately 25 to 35% [21] while for gag coding

sequences the separation is around 14% [22]. It has been suggested that

the estimated 1% mutation rate per year within the env gene has separated

any two HIV group M subtypes by at least 30 years [22]. Although mutations

generate novel strains, which can potentially become the roots of new

lineages their fixation into longer term HIV-1 subtypes depends on more

complicated epidemiological factors. Grenfell et al., suggested that the long

infection period of HIV-1 is responsible for the slow changing inter host

phylogenetic topography [1]. Such a long infection period allows for

individual risk groups to be replenished over time by new potential hosts.

Thus the inter host phylogeny of HIV reflects the behavioural, demographic

and epidemiological history of transmission rather than immune selection

alone, which is more responsible for the rapidly changing intra host viral

phylogeny. This is because natural selection of individual strains becomes

obsolete during transmission of the virus between individuals within such risk

groups due to the spread of the virus being largely dependent on the

behavioural aspects of individuals, size of risk population and time - and not

on the fitness of the individual viral strain. With the lack of natural selection

genetic drift has a large influence on the slow evolution of the current

lineages.

The Democratic Republic of Congo (DRC) has been proposed to be the

epicentre of the HIV-1 group M pandemic [23]. Uniquely to the region, with

the exception of subtype B [24], it was possible to identify strains from each

of the current global subtypes. A high degree of intra and inter subtype

diversity was also observed along with many unidentified strains [18, 23, 25].

Using the SDR it was shown by Rambaut et al., that group M strains from

within the DRC region had very little organized substructure when compared

to their global counterparts [18] (Fig. 1.4, Panel A). In a tree consisting

21

solely of strains isolated from the DRC region the double starburst shape

that is characteristic of the global group M phylogeny is no longer present

(Fig. 1.4, panel B). On the global tree (Fig. 1.3, Panel A) the well defined

clusters are as a result of chance exportations of DRC strains – with

resulting bottleneck effect – into new susceptible global risk groups followed

by subsequent diversification [18]. These founder effects are what gave rise

to the long evolutionary branches that extend into each of the subtypes [26].

As a result the global subtype based classification system does not reflect

the extent of true diversity present within the epicentre of the pandemic.

This will be the subject of chapter 4. Within other highly evolving species

such as influenza [17] a less than 2% change in the amino acid sequence

can cause failure in the cross-reactivity of the polyclonal response to the

influenza vaccine. Despite this, HIV-1 strains within the DRC are classified

into subtypes according to their proximity on a phylogenetic tree to their

global counterparts. Vaccine design strategies are often based on these

subtypes [27, 28]. This would have far reaching and potentially disastrous

consequences within the region as such vaccines could not protect the

population against the diversity of strains present.

22

Figure 1.4 Epicentre of the HIV-1 Group M Pandemic

(A) Maximum-likelihood phylogenies were estimated using HIV-1 sequences obtained from

within the Democratic Republic of Congo and HIV-1 global isolates. Given a phylogeny with

tips labelled according to subtype, the SDR was calculated (red arrows). A null distribution

of SDR values (blue) was obtained by simulating random phylogenies under a model of

exponential growth. (Figure and legend modified from [18]). (B) Neighbour joining

phylogenetic tree re-constructed using sequences sampled solely from within the DRC.

Sequences were obtained from [23, 25, 29]. The bold lettering corresponds to the subtype

designations from the Los Alamos HIV-1 sequence database.

23

Thirty four CRFs are currently known to exist within group M with CRF

01_AE (South East Asia), CRF 02_AG (West and central Africa), CRF

06_CPX (Eastern Europe) and CRF 11_CPX (South America) being the

most prevalent, accounting for 8.4% of the represented strains

(http://www.hiv.lanl.gov). This number is steadily increasing with newly

emerging CRFs frequently being discovered. In contrast to the nine

subtypes, with the exception of subtype G [20], their phylogenetic

relationship with each other is dependent on the region of the genome being

examined [19] as strains falling into these clusters are comprised of mosaic

genomes that originated from different ‘parent’ clusters (Fig. 1.5, panels A

and B).

Li et al., observed the first plausible case of recombination when an isolate

from the DRC designated MAL was found to have phylogenetic similarities

with other strains that could only be explained by recombination [30]. The

first detected case of recombination occurring between the subtypes B and F

was documented by Sabino et al., [31]. Robertson et al., produced evidence

that frequent recombination in fact had taken place between each of the

subtypes that were known at that time [19]. More recently it has been

suggested that in the early days of the pandemic at least 37% of the viruses

circulating in Central Africa were recombinant forms [25]. Osmanov et al.,

showed that in the year 2000 18% of HIV-1 infections globally were caused

by CRFs [32]. Further evidence that CRFs contribute a major part to the

diversity of the worlds HIV-1 pandemic can be found in [33-37].

Many unique recombinant forms (URFs) also exist (http://www.hiv.lanl.gov).

These arise in a similar way to the CRFs when two or more strains undergo

recombination within an individual host to form a mosaic genome. However

in contrast to CRFs they have not spread beyond their initial host [38].

Recent work has shown that globally URFs exist at very high frequencies

24

[39], for example out of nine samples taken from young adults in Southwest

Tanzania five were URFs between subtypes A and C [40].

It has also been estimated that up to 15% of viral genomes within an

individual who has not been dually infected have been altered by

recombination events that occurred after the initial exposure to the virus [13].

From the number of recombinant forms contributing to the global pandemic

at both an intra- and inter- host level it is clear that recombination contributes

a great deal to the diversity present within the phylogeny of HIV-1 group M

and thus is a major factor influencing diversity that must be considered when

developing any future vaccine.

A

B

Figure 1.5 Circulating Recombinant Forms

(A) The genome organisation of CRF03_AB was isolated in Kaliningrad, and is circulating in

Russian and Ukrainian cities. It is thought that the CRF03_AB epidemic in Kaliningrad

started from a single source arising from a recombination event occurring between the

subtype A strain virus prevalent among IDU in some southern Commonwealth of

Independent State countries, and a subtype B strain of unknown origin [41]. (B) The

genome organisation of CRF06_cpx which has been found to be circulating in Senegal,

Mali, Burkina, Ivory Coast, Nigeria, France and Australia [42]. It illustrates how CRFs can

also be formed by recombination events involving other CRFs.

Group O was first classified as a new HIV-1 group by Charneau et al., after

the discovery that an isolated strain from a French woman, designated HIV-

25

1VAU, was highly divergent from other strains constituting group M [43].

This divergent strain was found to be more closely related to two other

recently discovered divergent strains, HIV-1ant70 [44] and HIV-1MVP5180

[45]. The group has a 30 - 50% sequence divergence from group M

depending on the genes being compared [46]. For unknown reasons the

group has largely remained endemic to Cameroon where it is responsible for

1% of HIV-1 infections in the far north to 6.3% in the capital [47]. Yamaguchi

et al., show that in the Northwest of the country the prevalence of the group

may be even lower where it was observed to be responsible for only 0.4% of

HIV-1 infections [48]. Despite the low prevalence cases have been isolated

in Gabon, Equatorial Guinea, Nigeria, Benin, Ivory Coast, Togo, Senegal,

Niger, Chad, Kenya and Zambia as well as isolated cases in France,

Germany, Belgium, Spain, Norway and the United States where it is thought

that immigration is responsible for its introduction [49, 50]. There is no

evidence to support that the introduction of group O in these latter European

countries has spread epidemiologically beyond single individuals.

When group O strains are represented on a phylogenetic tree (Fig. 1.6,

Panel A) they lack the long arm branches reaching into distinctive clusters as

is characteristic of the group M phylogeny. Since the group has been

around for a similar length of time to that of the much more prevalent group

M [50-52] could the group’s phylogeny reflect the extent and pattern of

divergence found within the latter as initial studies would suggest [53, 54]?

Roques et al., proposed that after a phylogenetic study involving forty nine

group O strains there were no distinct clades that were equivalent to the

global group M subtypes [50]. Three broad phylogenetic clusters were

observed based on an analysis of gag and env sequences but, unlike the

subtypes of group M, the apparent clustering appeared to be very weak. It

was suggested that the clustering represented the local transmission

network of infected individuals rather than distinctive subtypes.

26

Yamaguchi et al., however, using a simplified tree representing the gp160

region (Fig. 1.6, Panel B), suggested that there was definite evidence to

support the identification of five possible clades (A, B, C, D and E) and a

further three possible sub clades within clade A (A1, A2 and A3) [55]. They

proposed that group O did have equivalent clades to the group M subtypes.

Further support of four of the five proposed clades (clades A, B, C and E)

was provided in a study analysing twenty three full length HIV-1 group O

isolates [46]. Clade A was seen to have the highest prevalence with 60% of

group O isolates to date falling within it. It was also unexpectedly shown that

13% of the full length genomes to be examined had evidence of

recombination. This high proportion of recombinant genomes is unexpected

due to the low prevalence of group O. For a recombinant form to arise a

dual infection must occur [56] and with the very low numbers of group O

infections [57] this seems like it should not be a very frequent event – with

the exception of strains involved in local transmission networks. Each of the

recombinant forms found had segments that were from clade A. The

previous clade D was found to be made up of the recombinant forms from

clade A and unclassified group O strains. Formal classification was not

possible as additional full length genomes are required to complete the clade

definition – although according to the authors this is just a matter of time.

To date HIV-1 group N is the rarest form of the virus and has only been seen

in Cameroon [58]. With the addition of strain 02CM-DJO0131 from [59]

there are presently only three near full length sequences available for this

group.

27

Figure 1.6 HIV-1 Group O

(A) Neighbour joining tree all the HIV-1 group O gp160 sequences found within the Los

Alamos HIV database. The bold letters corresponds to the cluster designations from [55]. *

represent bootstrap values greater than 90%. (B) “Simplified” phylogenetic trees group O

and group M env gp160 as described in [55]. Bootstrap values are shown for only the major

branches. The inner circle (diameter, 0.045 distance unit) represents the origin from which

the clusters radiate, whereas the outer circle represents the extent of genetic divergence

(diameter, 0.21 distance unit). The circles were constructed for the group M tree and then

superimposed on the group O tree. For group O roman numerals I to V correspond to

clades A to E. (Panel B taken from [55]).

28

The Origins of HIV

There are at least 36 distinct lentiviruses that infect African primates [60].

With only one exception, isolated from captive rhesus macaques, [61] all

have been found within African apes and monkeys [62]. The exception is

thought to have acquired SIV in captivity by cross-species transmission from

a SIV-infected African primate [61]. Five equidistant phylogenetic lineages

based on phylogenetic analysis of full-length pol protein sequences are

described in Hahn et al., [63]. These lentiviruses are referred to as simian

immunodeficiency virus (SIV) and in their natural primate hosts appear to

cause no adverse effects [63]. Viruses from individual primate species have

been observed to be more closely related to each other than to viruses from

a different species. For example in the four sub species of African green

monkeys (Chlorocebus aethiops, C. pygerythrus, C. sabaeus, and C.

tantalus) form monophyletic clusters that each contain strains that are more

closely related to each other than to the SIVs from the other clusters [63-65].

This along with the lack of virulence in the primate host could convincingly

suggest host dependent evolution [63].

However genetic diversity studies indicate that the African Green Monkey

(AGM) clade is millions of years old [66] while the most recent common

ancestor of SIVagm is far younger [67]. A resemblance between host and

pathogen phylogenies could have arisen as a result of preferential host

switching followed by subsequent diversification [68]. In fact Wertheim and

Worobey observed that AGM mitochondrial DNA and SIVagm sequences did

not share a phylogenetic topology that would coincide with long term co-

evolution [69]. Similar patterns of clustering can be observed in the two

species of chimpanzee that harbour these lentiviruses. Viruses isolated from

the chimpanzee Pan troglodytes troglodytes cluster together in the presence

of SIVs isolated from other primate species whereas the SIV isolated from

Pan troglodytes schwinfurthii falls outside of the p.t. troglodytes cluster [63,

29

70, 71]. Once again this would suggest that this pattern has arisen due to

preferential host switching [68] and not as a result of long term co-evolution.

The first full length SIV observed to have the same genetic structure as HIV-

1 was presented in [72]. This strain was referred to as SIVcpz-gab. The

organisation of the genome was found to be 5’ gag-pol-vif-vpr-tat-rev-vpu-

env-nef 3’. Specifically there was the presence of a HIV-1 specific vpu gene

and the absence of vpx gene, common to HIV-2 and most other SIVs. In

[73] a second complete genome of a chimpanzee lentivirus, referred to as

SIVcpz-ant, was obtained and found to have a similar genetic organization to

SIVcpz-gab. From protein sequence comparisons SIVcpz-ant was found to

be a closer relation to SIVcpz-gab and to HIV-1 isolates than to members of

the other four major phylogenetic lineages of these primate lentiviruses - but

as an out-group. Sequence identity with SIVcpz-gab was fairly low and

ranged from 72% (pol), 48% (env) and 25% (vpu). Phylogenetic analysis

revealed the possibility that groups O and M emerged as a result of separate

cross species transmission events [71]. The ancestral host primate species

that gave rise to HIV-1 was still unclear however as there appeared to be

very few chimpanzees naturally infected with SIVcpz [74] and thus there was

the possibility that both chimpanzees and humans could have acquired the

virus from a third reservoir species – for example Gorilla gorilla gorilla [75].

After isolating a third full length SIV strain, SIVcpzUS, and determining, by

mitochondrial DNA analysis the subspecies identity of all known infected

chimpanzees, Gao et al., observed that one subspecies of chimpanzee, P.t.

troglodytes, was the most probable natural ancestral host and reservoir for

HIV-1 [71]. More recently [76-78] have confirmed that P. t. troglodytes is

the most probable ancestral host for HIV-1 and that groups M, O and N have

resulted from three separate cross species transmission events (Fig. 1.7).

Keele et al., reported, using newly developed sampling techniques, that up

to 35% of individuals in some communities of wild living P. t. troglodytes

30

were infected by the virus [77] – a number that is higher than was previously

thought.

In relation to the rare group N, the origins were more obscured as the gag,

pol and the 3’ end of the vif gene was similar to HIV-1 group M while the env,

nef and 3’ end of vif gene was closely related to SIVcpzUS (Fig. 1.7). It has

been suggested by Gao et al., that the predecessor to this group, first

identified from a strain called YBF30, was created after a recombinant event

between two divergent strains of SIVcpzs within the primate host before its

jump in to the human population [71]. Strong evidence in support of this has

subsequently been provided [70, 79-81]. Many SIV strains are in fact

recombinant strains from earlier ancestors [82-84].

Figure 1.7 Cross Species Transmission

Maximum likelihood tree displaying the relationship between the three groups of HIV-1 and

their SIV relatives across two different regions of the genome. Sequences were obtained

31

from the Los Alamos HIV sequence database. The * represent separate cross-species

transmission events.

The predecessor to group M (SIVcpzptt) appears to have also been a

recombinant strain between two monkey SIVs – red-capped mangabeys

(sivrcm) and greater spot-nosed monkeys (sivgsn) [60, 85]. Each of these

SIVs displays similarities to SIVcpzptt but at different locations across the

genome (Fig. 1.8, Panels A and B). SIVgsn was the first monkey virus found

to possess the vpu gene and its 3’ half of the genome was observed to be

closely related to SIVcpzptt [86] while the 5’ end of the SIVcpzptt genome

was found to be closely related to SIVrcm [70, 87]. Both SIV from red-

capped mangabeys and greater spot-nosed monkeys could have ended up

within a single chimpanzee host as it is known that chimpanzees hunt

smaller monkeys for food in overlapping geographical locations within

central Africa overlap [88, 89]. After the divergence of P. t. schwinfurthii and

P. t. troglodytes Leitner et al., suggested that there is evidence for

subsequent SIV superinfection followed by further recombination events

within both lineages – resulting in the potential for further differences in their

evolutionary histories than was previously thought [90]. Group O strains also

show evidence of ancient recombination events between SIV strains within

their ancestral primate hosts [91].

When a virus enters a new host species successfully for the first time there

can be a wide range of different outcomes ranging from incidental infection

to epidemic spread [63]. Despite group M’s common ancestor existing within

the human population at a similar time to the common ancestor of group O

[3, 50, 52], the prevalence [47, 48] and geographical location [49] of the

latter remains highly limited. Group O’s prevalence has actually been

gradually decreasing [57]. Reasons for group M’s success in relation to

group O are poorly understood but initial studies suggest that group O may

have a reduced replicative and transmission fitness [92]. After a cross

species transmission the viral population must adapt to a new genetic and

32

immunologic environment [93]. The mechanism of adaptation is in the from

of alterations in its amino acid sequence which result in: (i) altering the

efficiency of cell entry, (ii) blocking interactions with detrimental host proteins

and (iii) promoting escape from the immune system [94].

Figure 1.8 Recombinant origin of SIVcpz

(A) Maximum likelihood phylogenies of primate lentiviruses pol and env sequences. The

close genetic relationship of SIVcpz to SIVrcm from red-capped mangabeys in pol, and to

SIVgsn, SIVmus, and SIVmon from greater spot-nosed, mustached, and mona monkeys in

Env, are highlighted in green and magenta, respectively. (B) Schematic diagram of the

genomic organization of SIVcpz. Genomic regions are colored according to their genetic

relationship to SIVrcm (green) or the SIVgsn / SIVmus / SIVmon lineage (magenta).Grey

33

areas in SIVcpz are of unknown origin. Vpu and vpx genes are highlighted. (Diagram and

legend taken from [60]).

Recent evidence of SIVs adapting to new hosts is presented in [94] where

selection within a rhesus macaque host was found to be acting strongly with

the V2 loop after the primate was inoculated from a SIV strain found within

sooty mangabeys. In HIV-1 a host specific adaptation has been observed to

occur at a site within the p17 region [95]. In SIVcpzPtt a methionine (Met) is

present at the site while in each HIV-1 group Arginine (Arg) is present. On

passage of the human virus through chimpanzees the Arg reverted to Met.

SIV and HIV sequences have substantial differences in their major proteins

but the relevance of these differences to the biology of the viruses is poorly

understood [95]. As well as host specific adaptation other reasons why

group M is far more prevalent than group O may include behavioural,

demographic and epidemiological history of transmission [1].

Hahn et al., proposed that the primate ancestral host of HIV-2 was the sooty

mangabey [63]. Each of the different lineages of HIV-2, termed subtypes A–

F, are due to multiple-cross-species transfer events from sooty mangabeys

to humans [96]. Geographic evidence also supports the theory that HIV-2

originated from the sooty mangabey as the virus is only common in West

Africa – which is where the habitat for these monkeys is found [96]. The

mode of transmission between humans and sooty mangabeys was thought

to have been hunters becoming infected by the latter that they hunt for food

[97].

Recombination

As has already been discussed, recombination occurs frequently within and

between the different HIV-1 groups, subtypes and recombinant forms.

Recombination provides potential benefits that include: the spread of

beneficial mutations throughout the population, increasing the variability

34

present within the population and aiding in the exploration of fitness or

“adaptive” peaks within regions of unexplored sequence space that would be

otherwise be unreachable by a step-by-step series of mutations [98, 99].

Two categories of recombination based on their epidemiological outcome

occur in relation to the HIV genome. These are single infection

recombination and dual infection recombination.

Single infection recombination occurs when recombination takes place

between two viral genomes within a host that has not been super-infected.

For a rapidly evolving genome this type of recombination could play a role in

disease progression within the host [100] as well as evading the hosts

immune response via the generation of new variant strains. Taylor and

Korber estimated that up to 15% of viral genomes within an individual which

has not been superinfected could be a result of intra-host recombination

events [13]. Dual infection recombination occurs when two divergent viral

genomes undergo recombination within a single host cell. Dual infection can

be a result of co-infection or of superinfection. Co-infection is where a

second virus enters the host at about the same time as the primary infection.

Superinfection on the other hand is where re-infection of a host occurs at

some distant point after the primary infection.

The number of CRFs and URFs that are currently in circulation, and indeed

the existence of the HIV-1 virus from groups M, N and O, could be viewed as

substantial evidence that dual infection does occur within natural

populations. However there are numerous publications specifically

documenting the occurrence of superinfection involving viral strains from

different subtypes within individuals [101-105] – and even within different

species of lentivirus such as FIVs [106]. The frequency at which

superinfection occurs globally has not been resolved and most probably

depends on a number of factors including the frequency of prevailing strains

35

within a region, the evolutionary distance between strains, geography and

the demographics of the host population.

Takehisa et al., observed in Cameroon that 3.01% of individuals were

superinfected [107]. One individual was found to be triply infected with HIV-

1 subtypes A and D as well as a group O strain. In Niger, of the 0.87% of

the population found to be HIV positive, 1.5% were infected by more than

one strain of the virus [101]. Hu et al., observed that in Bangkok,

superinfection with CRF01_AE followed 3.9% of subtype B primary

infections while superinfection with subtype B followed 1.5% of CRF01_AE

primary infections. It has been suggested that superinfection between two

closely related viruses is less likely than that of viruses that are more

distantly related [108] – despite the fact that an individual host is probably

more likely to come into contact with a closely related viral strain due to the

epidemiological patterns of the disease. If this is the case then we could

expect to see groupings of closely related viral strains on a phylogenetic tree

along with recombinant strains falling between those groups as the role that

recombination play’s in shuffling viral genomes would become more

important with increasing evolutionary distance.

Within an individual host cell the mechanisms of retroviral recombination

have been well studied. Copy-choice recombination is the most supported

model of how recombinant breakpoints arise [56, 109, 110]. In this model

(Fig. 1.9, Panel A) if reverse transcriptase becomes stalled on the donor

RNA (red) during reverse transcription nascent DNA can hybridize to the

acceptor RNA (blue). The reverse transcriptase complex can then

disassociate from the donor RNA and rebinds to the acceptor RNA. Reverse

transcription will then continue using the new template. Some plausible

alternatives to this model are presented in [111] but all involve strand

switching following a stalling of the reverse transcription complex.

36

It has been observed that breakpoint positions are distributed across the

gp120 region non-randomly, in the absence of natural selection (Fig. 1.9,

Panel B) [112]. This non random distribution of breakpoints has been

observed in vivo across the entire HIV-1 genome [113, 114]. Positioning of

breakpoints is believed to be influenced by a number of mechanistic factors

including high sequence identity [112, 115], secondary RNA structure [109,

116] and the location of runs of identical nucleotides referred to as

homopolymeric stretches (HPS’s) [112, 117]. The latter two increase the

probability of a breakpoint occurring by stalling the reverse transcriptase

complex during DNA synthesis [118-120], which in turn promotes the

induction of strand switching within regions of high sequence identity.

Figure 1.9 Non Random Recombination

(A) Model of copy-choice recombination. The dotted red line represents the growing DNA

strand. The full red line represents the donor RNA. Blue represents the acceptor RNA

sequence. The green dot is some mechanistic feature, such as a homopolymeric run, on

the donor strand that can potentially stall the reverse transcriptase complex (yellow). (B)

37

Distribution of inter – subtype and (C) intra – subtype breakpoints along the gp120. The

constant regions are shaded dark grey. The different parental pairs used are shown on the

left. Black numbers give the number of breakpoints identified within each region, and grey

triangles their approximate position. The total amount of recombinants analysed for each

pair is given on the right, together with the P-values (in grey) for Chi-square tests for a

random distribution. (Panels C and D taken from [112]).

Unsurprisingly, not all recombinant strains generated within the host cell are

viable. Preliminary findings [121] suggest that in an in vitro multiple cycle

system over 75% of HIV-1 recombinants generated in a dual infection do not

have the ability to replicate. Additionally, if copy choice in vivo does result in

a recombinant genome with the ability to replicate, that mosaic genome will

have to survive selective pressure from the host’s immune response.

Differentiating between breakpoints that are under the influence of this

selective pressure from those that are generated based on the

characteristics of the parental sequences is vital to the understanding of

HIV’s evolution in relation to drug resistance [122, 123], generation of

escape mutants [124, 125], disease progression [126] and diversity [19, 25,

32].

From an evolutionary perspective breakpoints that are generated solely

based on the characteristics of the parental sequences and that survive

simply because they are in regions with very weak selective pressure are

fairly unimportant in relation to the virus’s survival within the global

pandemic. Comparing the locations of breakpoints present within viable

recombinant sequences derived from established strains contributing to the

global pandemic [113] to a mechanistic probability of the distribution of

breakpoints in the absence of natural selection would provide an insight into

the selection pressures placed on the HIV genome. This would inevitably

help to identify regions of the genome that have a significant role in the

spread of the virus. This will be the subject of chapter 5.

38

Dealing with Global HIV-1 Diversity The rapid ability of HIV-1 to generate extensive diversity leads to persistent

infection within the host despite the actions of the immune response [127-

129]. The amount of diversity complicates the choice of candidates for a

potential vaccine at either an individual subtype level or that is cross reactive

between multiple subtypes [130]. In other rapidly evolving RNA virus

genomes, such as influenza [17], a less than 2% change in the amino acid

sequence can cause failure in the cross-reactivity of the polyclonal response

to a vaccine.

Within an infected individual, both HIV neutralizing antibodies and cytotoxic

T cells are generated. Thus there are currently two main approaches to

producing a vaccine: those based around the B-cell response and those that

aim to stimulate the T-cell response. Traditionally, vaccines based around

the B-cell response have been more successful although an effective HIV-1

vaccine has not been produced [131]. More recently there is much

discussion of the possibility of a T-cell based vaccine. It has been observed

that in humans the Cytotoxic T lymphocyte (CTL) response plays an

important role in controlling viremia during the course of an HIV-1 infection

[132-135] while in animals the CTL response has an important role in

controlling viremia in relation to SIV infection [27]. It has also been observed

that there is a strong correlation between T cell responses and CD4+ T cell

count [136].

Each T cell has many copies of an identical epitope specific T cell receptor

that recognize fragments of peptides (epitopes) that are complexed with a

major histocompatibility complex (MHC) protein on the target cell membrane.

Epitopes are formed at random by intra cellular proteolytic processing by

proteosomes. In a non infected cell these will be self-peptides while in an

infected cell a proportion will be viral peptides. T lymphocytes themselves

39

fall into two categories: those expressing the CD8 receptor (Tc Cells) and

those expressing the CD4 receptor (Th Cells). CD8+ Tc cells recognize

epitope bound class I MHC proteins while CD4+ Th cells recognize epitope

bound class II MHC proteins. Class I MHC proteins binds with peptides of

between 8 and 10 amino acids in length that are derived from intra cellular

peptides. Class II MHC proteins bind with peptides of between 17 and 22

amino acids in length that are derived from peptides that have been imported

into the host cell. Binding will result in the CTL’s clonal expansion and

differentiation to become a functional T-effector cell. All cells of the body

have MHC I proteins and are thus under constant surveillance by CD8+ T

cells whose main function after activation is to destroy host cells that are

presenting viral peptides. The major function of activated CD4+ T cells is to

regulate the activation of all T and B lymphocytes. The immune response

depends on this help which is why their count can be used to track disease

progression.

The aim of producing a vaccine based on the T cell response is to produce a

polyvalent CD8+ T-cell based vaccine that is broadly immunogenic in

relation to the diversity present. Epitopes used by CD8+ T-cell are on

average nine-residues in length and so will be hereafter referred to as “nine-

mers”. A list of known epitopes is maintained at the Los Almos HIV

sequence database many of which have been characterized in [137]. The

list however is far from complete and basing vaccine design strategies solely

around it could be largely ineffective - although incorporation of these

epitopes into existing strategies could improve results.

Traditionally the main approaches at attempting to improve immunogenic

coverage have focused around the use of consensus sequences, ancestral

sequences, center of the tree (COT) sequences and geographically

clustered sequences [28]. A consensus sequences (Fig. 1.10, yellow) is a

sequence that at each site contains the character that was most common

40

amongst the aligned sequences that it was generated to represent. This

approach for generating an artificial sequence implicitly minimizes the

genetic distance between itself and circulating strains of the virus. The COT

(Fig. 1.10, circle) sequence represents the point on a phylogenetic tree

where the average evolutionary distance to each tip is minimized [138].

COT thus provides coverage that is at least similar to the consensus or

ancestral sequences and it has the added benefit of being implicitly more

similar than the ancestral sequence to viral lineages that have evolved more

rapidly and so may be more useful in relation to asymmetric trees. Use of

such sequences in relation to vaccine design has generated some promising

results, for example recently in [27] a group M env consensus strain (CON6),

based on sequences available form the Los Alamos HIV database in 1999,

was created and it was found to induce a greater number of T-cell epitope

responses than any single wild-type strain (from subtypes A, B and C).

A more novel approach involves the use of algorithms to improve nine-mer

coverage across an alignment in the hope that some of these high covering

nine-mers will be viable epitopes that can be targeted by the hosts immune

response [139, 140]. Generally coverage describes the number of times that

an individual nine-mer exists within a sequence alignment independent of its

location. The local coverage score provided by an individual nine-mer within

a sequence is the coverage that is provided across all of the sequences

within an alignment at the site where the nine-mer occurs. The mean local

coverage provided by the sequence is the average of all the local covering

scores provided by each nine-mer across the sequence.

The general definition of coverage, although currently in use [139, 140],

could potentially favor the use of nine-mers with poor local covering scores

over those with high local covering scores. For example, across a given

alignment there could exist a unique nine-mer permutation that is only

present at a single site. This nine-mer could provide a very high local

41

coverage score. Across other sites, numerous instances of another unique

nine-mer permutation may occur. However, these may all provide very poor

local coverage scores. When combined, the coverage score of the latter

could outweigh the coverage score provided by the instance of the single

high scoring nine-mer. Optimizing the local coverage scores, instead of

selecting nine-mers based on overall coverage score, avoids this situation as

the favorable nine-mer would always be chosen over multiple occurrences of

an unfavorable one. This is the aim of chapter 6 (Fig. 6.5).

Figure 1.10 Four Phylogenetic shapes

Four possible phylogenetic shapes and the resulting reconstructed sequences. The

Ancestor (Anc) and the Center of the Tree (COT) can only fall on an evolutionary path (i.e.,

on a branch of the phylogeny), whereas the consensus (Con) may not. (Figure and legend

taken from [138]).

Co-receptor Usage

HIV-1 viruses can be characterized into two phenotypes referred to as

syncytium inducing (SI) and non – syncytium inducing (NSI) [141]. These

42

phenotypes have different cellular tropisms and appear during different

stages of infection [142]. The macrophage tropic NSI phenotype is

predominant during the early stages of infection [143] while the T cell tropic

SI phenotype is often observed later during and around the time of

progression to AIDS [144]. The gp120 gene can be divided into alternating

constant and variable regions referred to as C1 – C5 and V1 – V5 [145].

The variable regions encode exposed surface loops [146]. Amino acid

sequence variations giving rise to a more positive charge within gp120 at

sites 306 and 320 (sites 11 and 25 of the V3 loop) have been strongly

associated with the more virulent SI phenotype [147-149].

With the discovery of the involvement of primary co-receptors, CXCR4 and

CCR5, in cellular tropism [150, 151] and the mapping of their usage to the SI

and NSI phenotypes respectively [152, 153], the SI and NSI phenotypes are

now often referred to as X4 (CXR4 tropic), R5 (CCR5 tropic) and R5X4

(viruses able to utilize both receptors) [151, 154]. The early dominance of

the CCR5-using phenotype may be due to a number of factors including

selection at the point of transmission [155-157], a higher cytopathicity of SI

variants resulting in host cells with a shorter life span [158] as well as

differing fitness levels of the two variants at different stages of disease

progression [159]. During the progression of infection within the host the

mechanisms that drive the change from the CCR5-using phenotype to the

CXCR4 using phenotype are unclear but a number of theories including the

transmission-mutation hypothesis, target-cell-based hypothesis and the

immune system-based hypothesis have been proposed [160].

These co receptors provide a potentially useful and novel target for

antiretroviral therapy [161] and could help to overcome some of the

problems associated with highly active antiretroviral therapy (HAART) such

as patients adherence to treatment [162] and the emergence of drug-

resistant variants [163, 164]. A number of small positively charged

43

compounds, such as T22 and AMD3100, have been observed to block

CXCR4 - tropic HIV-1 cell entry [165-167]. Inhibiting CXCR4 usage in this

way could theoretically prove useful during the later stages of infection to

slow down or potentially stall disease progression. However CXCR4 is

critical for haematopoiesis, cardiac function and cerebellar development

[168] and so using it as a target could prove difficult due to severe side

effects [169].

Targeting the CCR5 co-receptor is more promising as a natural

polymorphism (CCR5Δ32), with no apparent deleterious effects, exists within

humans resulting in either a reduced or complete absence of the co-receptor

depending on whether an individual is heterozygotic or homozygotic for the

mutation [170]. Individuals that are heterozygous for this mutation were

found to have a slower disease progression to AIDS while individuals that

are homozygotic for the mutation show strong resistance to HIV-1 [170-174].

Maraviroc is a potent orally bio available small-molecule developed by Pfizer

that binds to the CCR5 receptor making it unavailable for HIV-1 cell entry

[175, 176] – thus simulating the CCR5Δ32 phenotype. However in trials the

presence of the CXCR4-using phenotype pre treatment with Maraviroc is

predictive of failure as this phenotype will emerge once treatment begins

[177]. Determining the viral phenotypes present within an individual is thus

of major importance in determining the probability for success.

Three common genomic approaches to phenotype testing are based on

sequence variation within the V3 loop. The first involves looking for the

presence of positively charged amino acid residues, histidine (h), lysine (k)

and arginine (r), at sites 11 or 24 and 25. These sites come into contact with

each other on the surface of the tertiary protein structure to form an exposed

charged patch [178]. When no positive residues are present it is thought

that the electrostatic interaction with the CCR5 coreceptor is stabilized [178,

179]. This test has been traditionally referred to as the charge rule and it

44

demonstrates a 94% accuracy in determining viral phenotype on sequences

of known phenotype [178].

A second method of determining phenotype involves looking at the net

charge across the entire V3 region using position-specific scoring matrices

(PSSM) [180]. Two reference groups of aligned sequences – each made up

of V3 subtype B sequences with a specific phenotype are compared to an

input V3 sequence. A matrix of log likelihood ratio scores for each site of the

input sequence is generated which reflects the difference in abundance of a

particular amino acid within the CXCR4-using dataset to that within the

CCR5 dataset. For the input sequence, the log likelihood ratio for the

particular amino acid at each site is then read from the matrix. The PSSM

score for the input sequence is the sum of the score at each site. This

method of predicting phenotype provided a sensitivity of 84% and a

specificity of 96% in detecting the CXCR4-using phenotype when performed

on V3 sequences with known phenotype [180]. A web interface for the

method is available at:

http://ubik.microbiol.washington.edu/computing/pssm/. Recently, reference

datasets for subtype C have been added to allow for subtype C phenotype

predictions [181]. For the latter a specificity of 94% and a sensitivity of 75%

were achieved on sequences of known phenotype.

A third common method of genotypically determining phenotype is referred

to as geno2pheno [182]. Here CXCR4 and CCR5 training datasets are

again used in a position comparison but factors such as the occurrence of

specific amino acids as well as glycosylation sites are also taken into

account as well as charge. Data is then given to a Support Vector Machine

in order to train the prediction model. A query V3 sequence can then be

aligned and a prediction given using the learned model. An accuracy of

71.8% is achieved when predicting the CCR4-using phenotype with this

method [183].

45

Although the charge rule works well the accuracy the nature of geno2pheno

and the PSSM test could suggest that there may be other sites within the V3

region that contribute to co-receptor tropism. A proposed advantage of the

latter over the charge rule is that they may provide an opportunity for early

detection of the CXCR4-using phenotype, as more subtle alterations within

the V3 sequence can be detected. However a disadvantage is that full

length V3 sequences are required and in light of new sequencing

technologies, such full length V3 sequences are not always provided [184].

Pyrosequencing is an alternative technique, in relation to the more traditional

Sanger sequencing [185], for sequencing DNA [186]. It is a non-gel-based

sequencing technology that is based on the detection of nucleotide

incorporation (Fig. 1.11). Nucleotides are added via primer-directed

polymerase extension. Initially the DNA fragment of interest is incubated

with four enzymes: DNA polymerase, ATP sulfurylase, firefly luciferase, and

a nucleotide-degrading enzyme. The four nucleotide bases are then added

and removed in a cyclic order. If a nucleotide is complementary with the

next base in the template and a base pair is created by DNA polymerase a

phosphate (PPi) is released. This is then converted to ATP by ATP

sulfurylase which is in turn used by firefly luciferase to generate detectable

light. The nucleotide-degrading enzyme is used to rapidly clear nucleotides

that have not been incorporated into the growing sequence. A modification

of this technology [184] can be used to rapidly produce unprecedented

quantities (in the order of hundreds of thousands of reads) of accurate [187]

genomic sequence data (Chapter 7).

46

Figure 1.11 How Pyrosequencing Works

See text for details. Modified from [188].

Such “ultra-deep” sequencing in conjunction with genomic phenotype testing

permits the quantification of the range of sequence variants present within

an individual HIV-1 sample [189]. This can be used to track the emergence

of low frequency viral variants [188, 190]. Identification of low-frequency

variants is of particular importance in the context of co-receptor usage

(specifically CCR5 versus CXCR4-using phenotypes) where specific amino

acids are associated with receptor usage. This is particularly relevant in

relation to Pfizer’s new antiviral drug, Maraviroc [177]. However currently

there are few standard automated protocols for handling and analyzing the

large numbers of short (100 – 200bp) segments produced by this technology

[191, 192] and these depend on a large amount of computational power and

storage facilities. The development of such a protocol will be the subject of

chapter 7.

Remaining Chapters The remaining chapters of this thesis, with the exception of chapters 2 and 8,

will discuss in detail the topics introduced within this introduction chapter. An

47

overview of each of these chapters can be found within their accompanying

abstracts. Publication details are as follows:

Chapter 3: CTree - comparison of clusters between phylogenetic trees made easy

John Archer and David L. Robertson

Bioinformatics. (2007) 23(21):2952-3

JA wrote the program and prepared the manuscript. DLR

revised the manuscript and guided the programs development.

Both JA and DLR devised the project.


John Archer and David L. Robertson

AIDS. 2007 Aug 20;21(13):1693-700.

JA performed the analysis and wrote the initial draft of the

manuscript. DLR revised the manuscript and guided the

analysis. Both JA and DLR devised the project.

Chapter 5: Prediction of HIV-1 Recombination Breakpoint Location

PLOS Computational Biology (accepted pending minor

alterations to the text)

John Archer, John W. Pinney, Etienne Simon-Loriere, Jun

Fan, Eric J. Arts, Matteo Negroni and David L. Robertson

JA performed the analysis and prepared the initial manuscript

under the guidance of DLR. JP revised the manuscript. ESL,

JF, EJA and MN provided data for the analysis. Both JA and

DLR devised the project.

Chapter 6: A Strategy for Identifying Significant HIV Sequence Diversity Using Structural Constraints

PLOS Computational Biology (submitted)

48

John Archer, Simon G. Williams, Simon C. Lovell, David L.

Robertson

JA developed and implemented the algorithm for generating

sequence constructs from the input alignments. JA also

performed the coverage analysis for the paper. SW developed

the model for reducing sequenced diversity with the

alignments. SL and DLR guided the project and revised the

manuscript.

Chapter 7: Detection of Low Frequency CXCR4-Using HIV-1 Strains with Ultra – Deep Pyrosequencing

(In preparation)

John Archer, Marilyn Lewis, David L. Robertson

JA developed and implemented the software for analyzing the

data. JA wrote the manuscript. ML provided the data and

helpful discussion. DR revised the manuscript and guided the

project. Both JA and DLR devised the protocol for detecting

low frequency variants within viral population using the

pyrosequenced data.

Chapter 2 is an introduction to the bioinformatics procedures, such as

sequence alignments, models of sequence evolution and phylogenetic trees

that will be used throughout the thesis. Chapter 8 is the concluding chapter

where each of the research chapters is tied together and where some final

comments about the evolution of HIV-1 are presented.

49

References 1. Grenfell, B.T., et al., Unifying the epidemiological and evolutionary

dynamics of pathogens. Science, 2004. 303(5656): p. 327-32.

2. UNAIDS, AIDS Epidemic Update 2007. 2007.

3. Korber, B., et al., Timing the ancestor of the HIV-1 pandemic strains. Science, 2000. 288(5472): p. 1789-96.

4. Foley, B., An Overview of the molecular phylogeny of lentiviruses. 2000, HIV sequence compendium.

5. Coffin, J., J. Hughes, and V. HE, Retroviruses. 1997(1).

6. Wei, X., et al., Viral dynamics in human immunodeficiency virus type 1 infection. Nature, 1995. 373(6510): p. 117-22.

7. Wolinsky, S.M., et al., Adaptive evolution of human immunodeficiency virus-type 1 during the natural course of infection. Science, 1996. 272(5261): p. 537-42.

8. Malim, M.H. and M. Emerman, HIV-1 sequence variation: drift, shift, and attenuation. Cell, 2001. 104(4): p. 469-72.

9. Jetzt, A.E., et al., High rate of recombination throughout the human immunodeficiency virus type 1 genome. J Virol, 2000. 74(3): p. 1234-40.

10. Robertson, D.L., B.H. Hahn, and P.M. Sharp, Recombination in AIDS viruses. J Mol Evol, 1995. 40(3): p. 249-59.

11. Yang, W., J.P. Bielawski, and Z. Yang, Widespread adaptive evolution in the human immunodeficiency virus type 1 genome. J Mol Evol, 2003. 57(2): p. 212-21.

12. Choisy, M., et al., Comparative study of adaptive molecular evolution in different human immunodeficiency virus groups and subtypes. J Virol, 2004. 78(4): p. 1962-70.

50

13. Taylor, J.E. and B.T. Korber, HIV-1 intra-subtype superinfection rates: estimates using a structured coalescent with recombination. Infect Genet Evol, 2005. 5(1): p. 85-95.

14. Archer, J. and D.L. Robertson, Understanding the diversification of HIV-1 groups M and O. Aids, 2007. 21(13): p. 1693-700.

15. Williams, S.G., Archer, J., Lovell, S.C., Robertson, D.L, A rational strategy for HIV vaccine design based on the prediction of sequence evolution. PLoS Comput Biol, 2008 (accepted pending alterations).

16. Robertson, D.L., et al., HIV-1 nomenclature proposal. Science, 2000. 288(5463): p. 55-6.

17. Korber, B., et al., Evolutionary and immunological implications of contemporary HIV-1 variation. Br Med Bull, 2001. 58: p. 19-42.

18. Rambaut, A., et al., Human immunodeficiency virus. Phylogeny and the origin of HIV-1. Nature, 2001. 410(6832): p. 1047-8.

19. Robertson, D.L., et al., Recombination in HIV-1. Nature, 1995. 374(6518): p. 124-6.

20. Abecasis, A.B., et al., Recombination confounds the early evolutionary history of human immunodeficiency virus type 1: subtype G is a circulating recombinant form. J Virol, 2007. 81(16): p. 8543-51.

21. Thomson, M.M., L. Perez-Alvarez, and R. Najera, Molecular epidemiology of HIV-1 genetic forms and its significance for vaccine development and therapy. Lancet Infect Dis, 2002. 2(8): p. 461-71.

22. Janssens, W., A. Buve, and J.N. Nkengasong, The puzzle of HIV-1 subtypes in Africa. Aids, 1997. 11(6): p. 705-12.

23. Vidal, N., et al., Unprecedented degree of human immunodeficiency virus type 1 (HIV-1) group M genetic diversity in the Democratic Republic of Congo suggests that the HIV-1 pandemic originated in Central Africa. J Virol, 2000. 74(22): p. 10498-507.

24. Gilbert, M.T., et al., The emergence of HIV/AIDS in the Americas and beyond. Proc Natl Acad Sci U S A, 2007. 104(47): p. 18566-70.

51

25. Kalish, M.L., et al., Recombinant viruses and early global HIV-1 epidemic. Emerg Infect Dis, 2004. 10(7): p. 1227-34.

26. Worobey, M., The Origins and Diversification of HIV. Global HIV/AIDS Medicine, 2007: p. 13–21.

27. Weaver, E.A., et al., Cross-subtype T-cell immune responses induced by a human immunodeficiency virus type 1 group m consensus env immunogen. J Virol, 2006. 80(14): p. 6745-56.

28. Gaschen, B., et al., Diversity considerations in HIV-1 vaccine selection. Science, 2002. 296(5577): p. 2354-60.

29. Vidal, N., et al., Distribution of HIV-1 variants in the Democratic Republic of Congo suggests increase of subtype C in Kinshasa between 1997 and 2002. J Acquir Immune Defic Syndr, 2005. 40(4): p. 456-62.

30. Li, W.H., M. Tanimura, and P.M. Sharp, Rates and dates of divergence between AIDS virus nucleotide sequences. Mol Biol Evol, 1988. 5(4): p. 313-30.

31. Sabino, E.C., et al., Identification of human immunodeficiency virus type 1 envelope genes recombinant between subtypes B and F in two epidemiologically linked individuals from Brazil. J Virol, 1994. 68(10): p. 6340-6.

32. Osmanov, S., et al., Estimated global distribution and regional spread of HIV-1 genetic subtypes in the year 2000. J Acquir Immune Defic Syndr, 2002. 29(2): p. 184-90.

33. Dowling, W.E., et al., Forty-one near full-length HIV-1 sequences from Kenya reveal an epidemic of subtype A and A-containing recombinants. Aids, 2002. 16(13): p. 1809-20.

34. McCutchan, F.E., et al., HIV type 1 circulating recombinant form CRF09_cpx from west Africa combines subtypes A, F, G, and may share ancestors with CRF02_AG and Z321. AIDS Res Hum Retroviruses, 2004. 20(8): p. 819-26.

35. Monno, L., et al., HIV-1 subtypes and circulating recombinant forms (CRFs) from HIV-infected patients residing in two regions of central and southern Italy. J Med Virol, 2005. 75(4): p. 483-90.

52

36. Ndembi, N., et al., Genetic diversity of HIV type 1 in rural eastern Cameroon. J Acquir Immune Defic Syndr, 2004. 37(5): p. 1641-50.

37. Tee, K.K., et al., Emergence of HIV-1 CRF01_AE/B unique recombinant forms in Kuala Lumpur, Malaysia. Aids, 2005. 19(2): p. 119-26.

38. Takeb, E.Y., S. Kusagawa, and K. Motomura, Molecular epidemiology of HIV: tracking AIDS pandemic. Pediatr Int, 2004. 46(2): p. 236-44.

39. Najera, R., et al., Genetic recombination and its role in the development of the HIV-1 pandemic. Aids, 2002. 16 Suppl 4: p. S3-16.

40. Hoelscher, M., et al., High proportion of unrelated HIV-1 intersubtype recombinants in the Mbeya region of southwest Tanzania. Aids, 2001. 15(12): p. 1461-70.

41. Liitsola, K., et al., HIV-1 genetic subtype A/B recombinant strain causing an explosive epidemic in injecting drug users in Kaliningrad. Aids, 1998. 12(14): p. 1907-19.

42. Montavon, C., et al., CRF06-cpx: a new circulating recombinant form of HIV-1 in West Africa involving subtypes A, G, K, and J. J Acquir Immune Defic Syndr, 2002. 29(5): p. 522-30.

43. Charneau, P., et al., Isolation and envelope sequence of a highly divergent HIV-1 isolate: definition of a new HIV-1 group. Virology, 1994. 205(1): p. 247-53.

44. Vanden Haesevelde, M., et al., Genomic cloning and complete sequence analysis of a highly divergent African human immunodeficiency virus isolate. J Virol, 1994. 68(3): p. 1586-96.

45. Gurtler, L.G., et al., A new subtype of human immunodeficiency virus type 1 (MVP-5180) from Cameroon. J Virol, 1994. 68(3): p. 1581-5.

46. Yamaguchi, J., et al., Near full-length genomes of 15 HIV type 1 group O isolates. AIDS Res Hum Retroviruses, 2003. 19(11): p. 979-88.

53

47. Mauclere, P., et al., Serological and virological characterization of HIV-1 group O infection in Cameroon. Aids, 1997. 11(4): p. 445-53.

48. Yamaguchi, J., et al., HIV infections in northwestern Cameroon: identification of HIV type 1 group O and dual HIV type 1 group M and group O infections. AIDS Res Hum Retroviruses, 2004. 20(9): p. 944-57.

49. Quinones-Mateu, M.E., S.C. Ball, and E.J. Arts, Role of Human Immunodeficiency Virus Type 1 Group O in the AIDS Pandemic. AIDS Rev., 2000. 2: p. 190 - 202.

50. Roques, P., et al., Phylogenetic analysis of 49 newly derived HIV-1 group O strains: high viral diversity but no group M-like subtype structure. Virology, 2002. 302(2): p. 259-73.

51. Jonassen, T.O., et al., Sequence analysis of HIV-1 group O from Norwegian patients infected in the 1960s. Virology, 1997. 231(1): p. 43-7.

52. Lemey, P., et al., The molecular population genetics of HIV-1 group O. Genetics, 2004. 167(3): p. 1059-68.

53. Janssens, W., et al., Interpatient genetic variability of HIV-1 group O. Aids, 1999. 13(1): p. 41-8.

54. Loussert-Ajaka, I., et al., Variability of human immunodeficiency virus type 1 group O strains isolated from Cameroonian patients living in France. J Virol, 1995. 69(9): p. 5640-9.

55. Yamaguchi, J., et al., Evaluation of HIV type 1 group O isolates: identification of five phylogenetic clusters. AIDS Res Hum Retroviruses, 2002. 18(4): p. 269-82.

56. Hu, W.S. and H.M. Temin, Retroviral recombination and reverse transcription. Science, 1990. 250(4985): p. 1227-33.

57. Ayouba, A., et al., HIV-1 group O infection in Cameroon, 1986 to 1998. Emerg Infect Dis, 2001. 7(3): p. 466-7.

58. Ayouba, A., et al., HIV-1 group N among HIV-1-seropositive individuals in Cameroon. Aids, 2000. 14(16): p. 2623-5.

54

59. Bodelle, P., et al., Identification and genomic sequence of an HIV type 1 group N isolate from Cameroon. AIDS Res Hum Retroviruses, 2004. 20(8): p. 902-8.

60. Sharp, P.M., G.M. Shaw, and B.H. Hahn, Simian immunodeficiency virus infection of chimpanzees. J Virol, 2005. 79(7): p. 3891-902.

61. Hirsch, V.M., et al., An African primate lentivirus (SIVsm) closely related to HIV-2. Nature, 1989. 339(6223): p. 389-92.

62. Holmes, E.C., On the origin and evolution of the human immunodeficiency virus (HIV). Biol Rev Camb Philos Soc, 2001. 76(2): p. 239-54.

63. Hahn, B.H., et al., AIDS as a zoonosis: scientific and public health implications. Science, 2000. 287(5453): p. 607-14.

64. Allan, J.S., et al., Species-specific diversity among simian immunodeficiency viruses from African green monkeys. J Virol, 1991. 65(6): p. 2816-28.

65. Allan, J.S., et al., Isolation and characterization of simian immunodeficiency viruses from two subspecies of African green monkeys. AIDS Res Hum Retroviruses, 1990. 6(3): p. 275-85.

66. Shimada, M.K., K. Terao, and T. Shotake, Mitochondrial sequence diversity within a subspecies of savanna monkeys (Cercopithecus aethiops) is similar to that between subspecies. J Hered, 2002. 93(1): p. 9-18.

67. Sharp, P.M., et al., Origins and evolution of AIDS viruses: estimating the time-scale. Biochem Soc Trans, 2000. 28(2): p. 275-82.

68. Charleston, M.A. and D.L. Robertson, Preferential host switching by primate lentiviruses can account for phylogenetic similarity with the primate phylogeny. Syst Biol, 2002. 51(3): p. 528-35.

69. Wertheim, J.O. and M. Worobey, A Challenge to the Ancient Origin of SIVagm Based on African Green Monkey Mitochondrial Genomes. PLoS Pathog, 2007. 3(7): p. e95.

70. Corbet, S., et al., env sequences of simian immunodeficiency viruses from chimpanzees in Cameroon are strongly related to those of

55

human immunodeficiency virus group N from the same geographic area. J Virol, 2000. 74(1): p. 529-34.

71. Gao, F., et al., Origin of HIV-1 in the chimpanzee Pan troglodytes troglodytes. Nature, 1999. 397(6718): p. 436-41.

72. Huet, T., et al., Genetic organization of a chimpanzee lentivirus related to HIV-1. Nature, 1990. 345(6273): p. 356-9.

73. Vanden Haesevelde, M.M., et al., Sequence analysis of a highly divergent HIV-1-related lentivirus isolated from a wild captured chimpanzee. Virology, 1996. 221(2): p. 346-50.

74. Peeters, M., et al., Isolation and characterization of a new chimpanzee lentivirus (simian immunodeficiency virus isolate cpz-ant) from a wild-captured chimpanzee. Aids, 1992. 6(5): p. 447-51.

75. Van Heuverswyn, F., et al., Human immunodeficiency viruses: SIV infection in wild gorillas. Nature, 2006. 444(7116): p. 164.

76. Bibollet-Ruche, F., et al., Complete genome analysis of one of the earliest SIVcpzPtt strains from Gabon (SIVcpzGAB2). AIDS Res Hum Retroviruses, 2004. 20(12): p. 1377-81.

77. Keele, B.F., et al., Chimpanzee reservoirs of pandemic and nonpandemic HIV-1. Science, 2006. 313(5786): p. 523-6.

78. Nerrienet, E., et al., Simian immunodeficiency virus infection in wild-caught chimpanzees from cameroon. J Virol, 2005. 79(2): p. 1312-9.

79. Fonjungo, P.N., et al., Molecular screening for HIV-1 group N and simian immunodeficiency virus cpz-like virus infections in Cameroon. Aids, 2000. 14(6): p. 750-2.

80. Roques, P., et al., Phylogenetic characteristics of three new HIV-1 N strains and implications for the origin of group N. Aids, 2004. 18(10): p. 1371-81.

81. Simon, F., et al., Identification of a new human immunodeficiency virus type 1 distinct from group M and group O. Nat Med, 1998. 4(9): p. 1032-7.

56

82. Jin, M.J., et al., Infection of a yellow baboon with simian immunodeficiency virus from African green monkeys: evidence for cross-species transmission in the wild. J Virol, 1994. 68(12): p. 8454-60.

83. Sharp, P.M., D.L. Robertson, and B.H. Hahn, Cross-species transmission and recombination of 'AIDS' viruses. Philos Trans R Soc Lond B Biol Sci, 1995. 349(1327): p. 41-7.

84. Souquiere, S., et al., Wild Mandrillus sphinx are carriers of two types of lentivirus. J Virol, 2001. 75(15): p. 7086-96.

85. Bailes, E., et al., Hybrid origin of SIV in chimpanzees. Science, 2003. 300(5626): p. 1713.

86. Courgnaud, V., et al., Characterization of a novel simian immunodeficiency virus with a vpu gene from greater spot-nosed monkeys (Cercopithecus nictitans) provides new insights into simian/human immunodeficiency virus phylogeny. J Virol, 2002. 76(16): p. 8298-309.

87. Beer, B.E., et al., Characterization of novel simian immunodeficiency viruses from red-capped mangabeys from Nigeria (SIVrcmNG409 and -NG411). J Virol, 2001. 75(24): p. 12014-27.

88. Mitani, J.C. and D.P. Watts, Demographic influences on the hunting behavior of chimpanzees. Am J Phys Anthropol, 1999. 109(4): p. 439-54.

89. Uehara, S., Predation on mammals by the chimpanzee (Pan troglodytes). Primates, 1997. 38: p. 198-214.

90. Leitner, T., et al., Sequence Diversity among Chimpanzee Simian Immunodeficiency Viruses (SIVcpz) Suggests that SIVcpzPts Was Derived from SIVcpzPtt through Additional Recombination Events. AIDS Res Hum Retroviruses, 2007. 23(9): p. 1114-8.

91. Paraskevis, D., et al., Analysis of the evolutionary relationships of HIV-1 and SIVcpz sequences using bayesian inference: implications for the origin of HIV-1. Mol Biol Evol, 2003. 20(12): p. 1986-96.

57

92. Arien, K.K., et al., The replicative fitness of primary human immunodeficiency virus type 1 (HIV-1) group M, HIV-1 group O, and HIV-2 isolates. J Virol, 2005. 79(14): p. 8979-90.

93. Webby, R., E. Hoffmann, and R. Webster, Molecular constraints to interspecies transmission of viral pathogens. Nat Med, 2004. 10(12 Suppl): p. S77-81.

94. Vanderford, T.H., et al., Adaptation of a diverse simian immunodeficiency virus population to a new host is revealed through a systematic approach to identify amino acid sites under selection. Mol Biol Evol, 2007. 24(3): p. 660-9.

95. Wain, L.V., et al., Adaptation of HIV-1 to its human host. Mol Biol Evol, 2007. 24(8): p. 1853-60.

96. Sharp, P.M., et al., Origins and evolution of AIDS viruses. Biol Bull, 1999. 196(3): p. 338-42.

97. Wolfe, N.D., et al., Naturally acquired simian retrovirus infections in central African hunters. Lancet, 2004. 363(9413): p. 932-7.

98. Temin, H.M., Sex and recombination in retroviruses. Trends Genet, 1991. 7(3): p. 71-4.

99. Rambaut, A., et al., The causes and consequences of HIV evolution. Nat Rev Genet, 2004. 5(1): p. 52-61.

100. van Rij, R.P., et al., Evolution of R5 and X4 human immunodeficiency virus type 1 gag sequences in vivo: evidence for recombination. Virology, 2003. 314(1): p. 451-9.

101. Boisier, P., et al., Nationwide HIV prevalence survey in general population in Niger. Trop Med Int Health, 2004. 9(11): p. 1161-6.

102. Diaz, R.S., et al., Dual human immunodeficiency virus type 1 infection and recombination in a dually exposed transfusion recipient. The Transfusion Safety Study Group. J Virol, 1995. 69(6): p. 3273-81.

103. Fang, G., et al., Recombination following superinfection by HIV-1. Aids, 2004. 18(2): p. 153-9.

58

104. Salminen, M.O., et al., Evolution and probable transmission of intersubtype recombinant human immunodeficiency virus type 1 in a Zambian couple. J Virol, 1997. 71(4): p. 2647-55.

105. Yerly, S., et al., HIV-1 co/super-infection in intravenous drug users. Aids, 2004. 18(10): p. 1413-21.

106. Troyer, J.L., et al., Patterns of feline immunodeficiency virus multiple infection and genome divergence in a free-ranging population of African lions. J Virol, 2004. 78(7): p. 3777-91.

107. Takehisa, J., et al., Various types of HIV mixed infections in Cameroon. Virology, 1998. 245(1): p. 1-10.

108. Goulder, P.J. and B.D. Walker, HIV-1 superinfection--a word of caution. N Engl J Med, 2002. 347(10): p. 756-8.

109. Galetto, R., et al., Dissection of a circumscribed recombination hot spot in HIV-1 after a single infectious cycle. J Biol Chem, 2006. 281(5): p. 2711-20.

110. Lemey, P. and D. Posada, Book Chapter - Introduction to recombination detection. 2007.

111. Negroni, M. and H. Buc, Retroviral recombination: what drives the switch? Nat Rev Mol Cell Biol, 2001. 2(2): p. 151-5.

112. Baird, H.A., et al., Sequence determinants of breakpoint location during HIV-1 intersubtype recombination. Nucleic Acids Res, 2006. 34(18): p. 5203-16.

113. Fan, J., M. Negroni, and D.L. Robertson, The distribution of HIV-1 recombination breakpoints. Infect Genet Evol, 2007. 7(6): p. 717-23.

114. Magiorkinis, G., et al., In vivo characteristics of human immunodeficiency virus type 1 intersubtype recombination: determination of hot spots and correlation with sequence similarity. J Gen Virol, 2003. 84(Pt 10): p. 2715-22.

115. Zhang, J. and H.M. Temin, Retrovirus recombination depends on the length of sequence identity and is not error prone. J Virol, 1994. 68(4): p. 2409-14.

59

116. Moumen, A., et al., The HIV-1 repeated sequence R as a robust hot-spot for copy-choice recombination. Nucleic Acids Res, 2001. 29(18): p. 3814-21.

117. Klarmann, G.J., C.A. Schauber, and B.D. Preston, Template-directed pausing of DNA synthesis by HIV-1 reverse transcriptase during polymerization of HIV-1 sequences in vitro. J Biol Chem, 1993. 268(13): p. 9793-802.

118. Derebail, S.S. and J.J. DeStefano, Mechanistic analysis of pause site-dependent and -independent recombinogenic strand transfer from structurally diverse regions of the HIV genome. J Biol Chem, 2004. 279(46): p. 47446-54.

119. Lanciault, C. and J.J. Champoux, Pausing during reverse transcription increases the rate of retroviral recombination. J Virol, 2006. 80(5): p. 2483-94.

120. Roda, R.H., et al., Strand transfer occurs in retroviruses by a pause-initiated two-step mechanism. J Biol Chem, 2002. 277(49): p. 46900-11.

121. Baird, H.A., et al., Influence of sequence identity and unique breakpoints on the frequency of intersubtype HIV-1 recombination. Retrovirology, 2006. 3: p. 91.

122. Margot, N.A., J.M. Waters, and M.D. Miller, In vitro human immunodeficiency virus type 1 resistance selections with combinations of tenofovir and emtricitabine or abacavir and lamivudine. Antimicrob Agents Chemother, 2006. 50(12): p. 4087-95.

123. Perno, C.F., V. Svicher, and F. Ceccherini-Silberstein, Novel drug resistance mutations in HIV: recognition and clinical relevance. AIDS Rev, 2006. 8(4): p. 179-90.

124. Guillon, C., et al., Evidence for CTL-mediated selection of Tat and Rev mutants after the onset of the asymptomatic period during HIV type 1 infection. AIDS Res Hum Retroviruses, 2006. 22(12): p. 1283-92.

125. Pillay, T., et al., Unique acquisition of cytotoxic T-lymphocyte escape mutants in infant human immunodeficiency virus type 1 infection. J Virol, 2005. 79(18): p. 12100-5.

60

126. Iversen, A.K., et al., Conflicting selective forces affect T cell receptor contacts in an immunodominant human immunodeficiency virus epitope. Nat Immunol, 2006. 7(2): p. 179-89.

127. Ciurea, A., et al., CD4+ T-cell-epitope escape mutant virus selected in vivo. Nat Med, 2001. 7(7): p. 795-800.

128. Draenert, R., et al., Immune selection for altered antigen processing leads to cytotoxic T lymphocyte escape in chronic HIV-1 infection. J Exp Med, 2004. 199(7): p. 905-15.

129. Li, Y., et al., Broad HIV-1 neutralization mediated by CD4-binding site antibodies. Nat Med, 2007. 13(9): p. 1032-4.

130. Mthunzi, P. and D. Meyer, Limited cross-reactivity between different HIV-1 clades. J Clin Virol, 2004. 31 Suppl 1: p. S88-91.

131. McMichael, A.J., HIV vaccines. Annu Rev Immunol, 2006. 24: p. 227-55.

132. Borrow, P., et al., Virus-specific CD8+ cytotoxic T-lymphocyte activity associated with control of viremia in primary human immunodeficiency virus type 1 infection. J Virol, 1994. 68(9): p. 6103-10.

133. Koup, R.A., et al., Temporal association of cellular immune responses with the initial control of viremia in primary human immunodeficiency virus type 1 syndrome. J Virol, 1994. 68(7): p. 4650-5.

134. Musey, L., et al., Cytotoxic-T-cell responses, viral load, and disease progression in early human immunodeficiency virus type 1 infection. N Engl J Med, 1997. 337(18): p. 1267-74.

135. Rinaldo, C., et al., High levels of anti-human immunodeficiency virus type 1 (HIV-1) memory cytotoxic T-lymphocyte activity and low viral load are associated with lack of disease in HIV-1-infected long-term nonprogressors. J Virol, 1995. 69(9): p. 5838-42.

136. Wang, S., et al., Association between HIV Type 1-specific T cell responses and CD4+ T cell counts or CD4+:CD8+ T cell ratios in HIV Type 1 subtype B infection in China. AIDS Res Hum Retroviruses, 2006. 22(8): p. 780-7.

61

137. Yusim, K., et al., Clustering patterns of cytotoxic T-lymphocyte epitopes in human immunodeficiency virus type 1 (HIV-1) proteins reveal imprints of immune evasion on HIV-1 global variation. J Virol, 2002. 76(17): p. 8757-68.

138. Nickle, D.C., et al., Consensus and ancestral state HIV vaccines. Science, 2003. 299(5612): p. 1515-8; author reply 1515-8.

139. Fischer, W., et al., Polyvalent vaccines for optimal coverage of potential T-cell epitopes in global HIV-1 variants. Nat Med, 2007. 13(1): p. 100-6.

140. Nickle, D.C., et al., Coping with viral diversity in HIV vaccine design. PLoS Comput Biol, 2007. 3(4): p. e75.

141. Koot, M., et al., HIV-1 biological phenotype in long-term infected individuals evaluated with an MT-2 cocultivation assay. Aids, 1992. 6(1): p. 49-54.

142. Shankarappa, R., et al., Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. J Virol, 1999. 73(12): p. 10489-502.

143. Connor, R.I. and D.D. Ho, Human immunodeficiency virus type 1 variants with increased replicative capacity develop during the asymptomatic stage before disease progression. J Virol, 1994. 68(7): p. 4400-8.

144. Koot, M., et al., Conversion rate towards a syncytium-inducing (SI) phenotype during different stages of human immunodeficiency virus type 1 infection and prognostic value of SI phenotype for survival after AIDS diagnosis. J Infect Dis, 1999. 179(1): p. 254-8.

145. Starcich, B.R., et al., Identification and characterization of conserved and variable regions in the envelope gene of HTLV-III/LAV, the retrovirus of AIDS. Cell, 1986. 45(5): p. 637-48.

146. Leonard, C.K., et al., Assignment of intrachain disulfide bonds and characterization of potential glycosylation sites of the type 1 recombinant human immunodeficiency virus envelope glycoprotein (gp120) expressed in Chinese hamster ovary cells. J Biol Chem, 1990. 265(18): p. 10373-82.

62

147. De Jong, J.J., et al., Minimal requirements for the human immunodeficiency virus type 1 V3 domain to support the syncytium-inducing phenotype: analysis by single amino acid substitution. J Virol, 1992. 66(11): p. 6777-80.

148. Fouchier, R.A., et al., Phenotype-associated sequence variation in the third variable domain of the human immunodeficiency virus type 1 gp120 molecule. J Virol, 1992. 66(5): p. 3183-7.

149. Fouchier, R.A., et al., Simple determination of human immunodeficiency virus type 1 syncytium-inducing V3 genotype by PCR. J Clin Microbiol, 1995. 33(4): p. 906-11.

150. Dimitrov, D.S., et al., HIV coreceptors. J Membr Biol, 1998. 166(2): p. 75-90.

151. Moore, J.P., et al., The CCR5 and CXCR4 coreceptors--central to understanding the transmission and pathogenesis of human immunodeficiency virus type 1 infection. AIDS Res Hum Retroviruses, 2004. 20(1): p. 111-26.

152. Bjorndal, A., et al., Coreceptor usage of primary human immunodeficiency virus type 1 isolates varies according to biological phenotype. J Virol, 1997. 71(10): p. 7478-87.

153. Connor, R.I., et al., Change in coreceptor use coreceptor use correlates with disease progression in HIV-1--infected individuals. J Exp Med, 1997. 185(4): p. 621-8.

154. Berger, E.A., et al., A new classification for HIV-1. Nature, 1998. 391(6664): p. 240.

155. Agace, W.W., et al., Constitutive expression of stromal derived factor-1 by mucosal epithelia and its role in HIV transmission and propagation. Curr Biol, 2000. 10(6): p. 325-8.

156. Meng, G., et al., Primary intestinal epithelial cells selectively transfer R5 HIV-1 to CCR5+ cells. Nat Med, 2002. 8(2): p. 150-6.

157. Reece, J.C., et al., HIV-1 selection by epidermal dendritic cells during transmission across human skin. J Exp Med, 1998. 187(10): p. 1623-31.

63

158. Rodrigo, A.G., Dynamics of syncytium-inducing and non-syncytium-inducing type 1 human immunodeficiency viruses during primary infection. AIDS Res Hum Retroviruses, 1997. 13(17): p. 1447-51.

159. Arien, K.K., et al., Replicative fitness of CCR5-using and CXCR4-using human immunodeficiency virus type 1 biological clones. Virology, 2006. 347(1): p. 65-74.

160. Regoes, R.R. and S. Bonhoeffer, The HIV coreceptor switch: a population dynamical perspective. Trends Microbiol, 2005. 13(6): p. 269-77.

161. Pierson, T.C., R.W. Doms, and S. Pohlmann, Prospects of HIV-1 entry inhibitors as novel therapeutics. Rev Med Virol, 2004. 14(4): p. 255-70.

162. Ickovics, J.R. and C.S. Meade, Adherence to HAART among patients with HIV: breakthroughs and barriers. AIDS Care, 2002. 14(3): p. 309-18.

163. Kutilek, V.D., et al., Is resistance futile? Curr Drug Targets Infect Disord, 2003. 3(4): p. 295-309.

164. Little, S., et al., Antiretroviral-drug resistance among patients recently infected with HIV. Engl J Med, 2002. 347(6): p. 385-94.

165. Doranz, B., et al., A small-molecule inhibitor directed against the chemokine receptor CXCR4 prevents its use as an HIV-1 coreceptor. Journal of Experimental Medicine, 1997. 186(8): p. 1395-400.

166. Murakami, T., et al., A small molecule CXCR4 inhibitor that blocks T cell line-tropic HIV-1 infection. Journal Experimental Medicine, 1997. 186(8): p. 1389-93.

167. Schols, D., et al., Inhibition of T-tropic HIV strains by selective antagonization of the chemokine receptor CXCR4. J Exp Med, 1997. 186(8): p. 1383-8.

168. Zou, Y.R., et al., Function of the chemokine receptor CXCR4 in haematopoiesis and in cerebellar development. Nature, 1998. 393(6685): p. 595-9.

64

169. Scozzafava, A., A. Mastrolorenzo, and C. Supuran, Non-peptidic chemokine receptors antagonists as emerging anti-HIV agents. J Enzyme Inhib Med Chem, 2002. 17(2): p. 69-76.

170. O'Brien, T., et al., HIV-1 infection in a man homozygous for CCR5Δ32. The Lancet, 1997. 349(9060): p. 1219-1219.

171. Dean, M., et al., Genetic restriction of HIV-1 infection and progression to AIDS by a deletion allele of the CKR5 structural gene. Hemophilia Growth and Development Study, Multicenter AIDS Cohort Study, Multicenter Hemophilia Cohort Study, San Francisco City Cohort, ALIVE Study. Science, 1996. 273(5283): p. 1856-62.

172. Eugen-Olsen, J., et al., Heterozygosity for a deletion in the CKR-5 gene leads to prolonged AIDS-free survival and slower CD4 T-cell decline in a cohort of HIV-seropositive individuals. Aids, 1997. 11(3): p. 305-10.

173. Huang, Y., et al., The role of a mutant CCR5 allele in HIV-1 transmission and disease progression. Nat Med, 1996. 2(11): p. 1240-3.

174. Meyer, L., et al., Early protective effect of CCR-5 delta 32 heterozygosity on HIV-1 disease progression: relationship with viral load. The SEROCO Study Group. Aids, 1997. 11(11): p. F73-8.

175. Dorr, P., et al., Maraviroc (UK-427,857), a potent, orally bioavailable, and selective small-molecule inhibitor of chemokine receptor CCR5 with broad-spectrum anti-human immunodeficiency virus type 1 activity. Antimicrob Agents Chemother, 2005. 49(11): p. 4721-32.

176. Fatkenheuer, G., et al., Efficacy of short-term monotherapy with maraviroc, a new CCR5 antagonist, in patients infected with HIV-1. Nat Med, 2005. 11(11): p. 1170-2.

177. Westby, M., et al., Emergence of CXCR4-using human immunodeficiency virus type 1 (HIV-1) variants in a minority of HIV-1-infected patients following treatment with the CCR5 antagonist maraviroc is from a pretreatment CXCR4-using virus reservoir. J Virol, 2006. 80(10): p. 4909-20.

178. Cardozo, T., et al., Structural basis for coreceptor selectivity by the HIV type 1 V3 loop. AIDS Res Hum Retroviruses, 2007. 23(3): p. 415-26.

65

179. Rosen, O., et al., Molecular switch for alternative conformations of the HIV-1 V3 region: implications for phenotype conversion. Proc Natl Acad Sci U S A, 2006. 103(38): p. 13950-5.

180. Jensen, M.A., et al., Improved coreceptor usage prediction and genotypic monitoring of R5-to-X4 transition by motif analysis of human immunodeficiency virus type 1 env V3 loop sequences. J Virol, 2003. 77(24): p. 13376-88.

181. Jensen, M.A., et al., A reliable phenotype predictor for human immunodeficiency virus type 1 subtype C based on envelope V3 sequences. J Virol, 2006. 80(10): p. 4698-704.

182. Sing, T., Beerenwinkel, N., Lengauer, T., Learning mixtures of localized rules by maximizing the area under the ROC curve. 2004.

183. Poveda, E., et al., Correlation between a phenotypic assay and three bioinformatic tools for determining HIV co-receptor use. Aids, 2007. 21(11): p. 1487-90.

184. Margulies, M., et al., Genome sequencing in microfabricated high-density picolitre reactors. Nature, 2005. 437(7057): p. 376-80.

185. Sanger, F. and A.R. Coulson, A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J Mol Biol, 1975. 94(3): p. 441-8.

186. Ronaghi, M., M. Uhlen, and P. Nyren, A sequencing method based on real-time pyrophosphate. Science, 1998. 281(5375): p. 363, 365.

187. Huse, S.M., et al., Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol, 2007. 8(7): p. R143.

188. O'Meara, D., et al., Monitoring Resistance to Human Immunodeficiency Virus Type 1 Protease Inhibitors by Pyrosequencing. J. Clin. Microbiol., 2001. 39(2): p. 464-473.

189. Wang, C., et al., Characterization of mutation spectra with ultra-deep pyrosequencing: application to HIV-1 drug resistance. Genome Res, 2007. 17(8): p. 1195-201.

66

190. Hoffmann, C., et al., DNA bar coding and pyrosequencing to identify rare HIV drug resistance mutations. Nucleic Acids Res, 2007. 35(13): p. e91.

191. Trombetti, G.A., et al., Data handling strategies for high throughput pyrosequencers. BMC Bioinformatics, 2007. 8 Suppl 1: p. S22.

192. Lewis, M., et al., Evaluation of an Ultra-Deep Sequencing Method to Identify Minority Sequence Variants in the HIV-1 env Gene from Clinical Samples, in 14th Conference on Retroviruses and Opportunistic Infections. 2007: Los Angeles, USA.

67

Chapter 2: Sequence Data and Phylogenetic Trees

Molecular Phylogeny

Understanding evolutionary relationships between different organisms is a

fundamental aspect of modern day biology. Tree structures are generally

used to depict these relationships. In the days of Charles Darwin rough tree

sketches were based on fossil records, morphology and geographical

distribution [1]. This is no longer the case. With the advent of sequencing

technologies [2] and the realization that both DNA and amino acid

sequences could be used to accurately determine the relationship between

different organisms [3] a plethora of tree producing algorithms have emerged

[4] along with a branch of science referred to as molecular phylogeny.

Molecular phylogeny is the science of estimating evolutionary histories using

DNA and amino acid sequences.

The first step in producing an evolutionary history is the identification of

homologous sequences. These are sequences that share a common

ancestry [5]. There are different types of homology which include orthology

and paralogy. Orthologous sequences share similarities because they

originated from a common ancestor. Paralogous sequences on the other

hand share similarities due to gene duplication events within an individual

species. To infer the evolutionary history between different organisms

orthologous sequences are required. These can be aligned after which trees

representing the evolutionary relationships between the sequences can be

inferred. To improve the accuracy of the evolutionary relationships within the

tree, models of sequence evolution are incorporated. Once a tree has been

created there are many programs available for viewing and analysing the

tree topology. In this chapter a few of the many aspects of aspects of

molecular phylogeny will be discussed.

68

Global Alignments Orthologous HIV sequences can be obtained from the Los Alamos HIV

sequence database using the search interface provided at

http://www.hiv.lanl.gov. Before a tree can be inferred the sequences must

be aligned. The accuracy of the alignment generated will directly affect the

quality any inference of phylogenetic history. In 1970 Needleman and

Wunsch published a progressive alignment algorithm for performing a global

pairwise alignment on two sequences [6]. The algorithm matches together

as many characters as possible between two input sequences regardless of

their lengths. It uses a process referred to as dynamic programming and is

guaranteed to find the alignment with the highest score. The score between

two sequences provides information about their evolutionary relationship to

each other. When more than two sequences are present the scores

between all combinations of sequence pairs form the starting point for

producing a multiple alignment. The most famous programs implementing

this algorithm are the Clustal series of programs [7-10] and the more recent

Muscle [11]. In these programs the Needleman and Wunsch algorithm is

used to align all pairs of sequences within the input dataset in order to obtain

their pairwise scores. These scores are then used in the construction of a

rough guide tree which is in turn used to create a multiple sequence

alignment.

Given two unaligned nucleotide sequences, e.g. seq1 = ATGCTT and seq2 =

ATCTA, the first step to applying the Needleman and Wunsch algorithm is to

define a scoring system in order to score any two matched or unmatched

nucleotides. For example: if the nucleotides are identical the column scores

+1, if the nucleotides are different the column scores -1 and if a gap is

present the column scores -2. Usually there is a distinction between the gap

opening penalty and the extension penalty as from an evolutionary

perspective it is more difficult to open the first gap in the sequence rather

69

than to extend an already existing gap. For nucleotides there is often a

distinction between transitions and tranversions as transitions are more

frequent. Ideally at this point a model of nucleotide evolution would be

incorporated into the alignment process. However methods for incorporating

such models directly into the multiple alignment process are not available

and so models are incorporated during the tree inference process. For

amino acid alignments this scoring system is usually based on a substitution

matrix such as the BLOSUM (Blocks Amino Acid Substitution Matrices) [12]

or PAM (Percent Accepted Mutation) [13] series. These matrices generally

reflect the probabilities of one amino acid mutating to another based on large

datasets of pairwise alignments.

From left to right any partial alignment can now be scored by the sum of the

scores for each column so far. For example using the simple system

outlined above:

1 ATCATG

=

And 0 ATATG

=

−

When a column is added to the right in order to score the alignment we only

need to look at the added column and use the knowledge of the alignment

score of everything that has gone before that column:

0 (-1) 1 TC

ATCATG

=+=

+

Using this method of scoring we can define a scoring matrix (Fig. 2.1) in

which the best score at any given position, with the exception of the first row

and column, is the maximum of one of three values:

70

) index2 , 1) - ex1score((ind 2-1)) - (index2 , x1score(inde 2-

1)) - (index2 1), - ex1score((ind c2) p(c1,max index2) x1,score(inde

+

+

+

=

Where p(c1, c2), based on the simple scoring system above, returns 1 if c1 =

c2 and -1 if c1 ≠ c2 and c1 is the character at index1 on seq1 and c2 is the

character at index2 on seq 2. The -2 represents the gap introducing score.

Figure 2.1 Scoring Matrix for Aligning Two Sequences

A scoring matrix generated for the two example sequences ATGCTT (sequence 1) and

ATCTA (sequence 2). Yellow indicates the score of the alignment so far by the addition of

the prefix C:C. This depends on one of 3 choices as described in the text. Red is the score

of everything that went before when no gap was introduced into the last position. Blue

indicates the previous score when a gap was introduced into the first sequence while green

indicates the previous score when a gap was introduced into sequence 2.

Once the scoring matrix as been populated it can be used by a recursive

algorithm to find the optimum alignment. The algorithm starts at the lower

right hand corner of the matrix and recursively follows the best path into the

matrix until the start has been reached (Fig. 2.2 - left). When the recursion

has ended, a process referred to as trace-back begins (Fig. 2.2 - right). This

is the return journey through the matrix for each of the previous recursive

71

calls. At each step a column is added to the alignment. In the addition of a

column one of three options is chosen:

i. No gap being inserted and both growing sequences being extended

by the next nucleotide (resulting in either a direct match or mismatch).

ii. A gap being inserted in sequence 1 while the next nucleotide is

inserted into sequence 2.

iii. A gap being inserted in sequence 2 while the next nucleotide being

inserted into sequence 1.

Once the trace-back is complete both sequences have been optimally

globally pairwise aligned and the relationship between the two sequences

can be assessed.

Figure 2.2 Recursion and Trace-back

Once recursion has been complete (left – red arrows) trace-back begins (right). During

trace-back the pairwise alignment between the two input sequences is constructed.

Local Alignments

Often only small fragments of sequences are available. To locate the

location of such fragments in relation to a larger sequence or to pairwise

72

align them to each other a local pairwise alignment is more desirable. This

is because it not desirable to attempt to spread the smaller sequence out

along the full length of the larger sequence in the search for matching the

most number of characters as the former forms an already intact part of

some gene. During a local alignment an attempt is made to align sequences

in relation to positions with the highest density of matches. The Smith-

Waterman algorithm for locally aligning two sequences is a slight

modification to the Needleman and Wunsch pairwise alignment algorithm

[14]. Firstly when defining the scoring matrix negatively scoring cells are set

to zero. To do this the scoring formulae score(index1, index2) becomes:

0index2) 1),-ex1score((ind2-

1))-(index2 x1,score(inde 2 -1))-(index2 1),-ex1score((indc2) p(c1,

max index2) x1,score(inde+

+

+

=

This renders all the possible local alignments visible. A second modification

is then at the start of the recursion process. Instead of starting at the lower

right hand corner the recursive process starts at the highest scoring cell and

proceeds until a cell of score zero is encountered. The path back from this

recursion will produce the best local alignment. Local alignments form an

integral part of the analysis presented in chapter 6 where many thousands of

short sequences must be indexed against a consensus template sequence.

Measuring Genetic Change

Once an alignment has been constructed the relationship between the

sequences can be derived from their evolutionary distances to each other. A

simple measure of evolutionary distance is a count of the number of sites

between any two sequences that are different. This is known as the

observed distance or p-distance. However the p-distance can underestimate

73

the number of substitutions that have actually taken place. This is as a

result of individual sites potentially undergoing multiple substitutions which

cannot be detected by a simple count of observed differences. Such a

mutation at an individual site may include an A changing to a T and then to a

C. Here two mutations have occurred but the p-distance will only account for

one. There is also the possibility of back mutation where an A may change

to a T and then back to an A. In this case the p-distance will not take any

mutation into account. A general rule is the larger the p-distance the more

room for error [15]. Over a large amount of time and/or when dealing with

rapidly evolving genomes, such as HIV, it is therefore important to attempt to

rectify this short coming of p-distance to increase the accuracy of an

phylogenetic inference.

As a result many distance correction methods have been developed. These

take into account various parameters such as base frequencies and

transversion and transition change rates. The most common models to date

are Jukes-Cantor [16], Kimura 2 Parameter [17], Felsenstein 81 [18], HKY85

[19] and the General Time Reversible model [20]. In the case of these they

are all related to each other in a nested manner. Jukes-Cantor was the first

model proposed and assumes that the four bases have equal frequencies

and that all substitutions are equally likely. Kimura 2 Parameter extends the

Jukes-Cantor model by allowing for differences in the rates of transversions

and transitions. Felsenstein 81 extends the Jukes-Cantor model by allowing

for different base frequencies. The HKY85 model combines both the Kimura

2 Parameter and the Felsenstein 81 models to allow for different transition

and transversion rates as well as to allow for varying base frequencies while

the General Time Reversible model assumes no difference in the rate of

change between pairs of nucleotides although each pair can have a different

substitution rate to a different pair.

74

In these models a simplifying assumption, at the expense of biological

realism, is that each position in the sequence is equally likely to under go a

substitution. Due to varying functional constraints on different genes and

non coding regions this assumption is known to be false. In an attempt to

overcome the problem models can be modified in order to allow for the rate

of variation of different sites within alignment e.g. HKY85 + Γ (where Γ

stands for rate variation) [21]. The most common distribution used to

describe this rate variation is the gamma distribution [22]. This distribution

has a parameter, α, that defines its shape. When α is small (<1) most sites

have a very low rate of variation but a few have a very large rate of variation.

This results in an L-shaped distribution. As α becomes larger (>1) the

distribution becomes more bell shaped resulting in sites having a smaller

range in rate variation. Estimates of α from various nuclear and

mitochondrial genes [21] range from 0.16 to 1.37 although when codon

positions are analyzed separately the value for the first and second position

is generally much smaller than that for the third position.

The larger gamma value of the third position of codons suggests that there is

generally a higher rate of variation at these sites. This is because many

nucleotide substitutions at these sites do not affect the translation of the

codon [22], i.e. they are synonymous substitutions, and thus they are largely

free from natural selection [23]. Using the ratio between the rates of non

synonymous (dn) and synonymous (ds) substitutions within a genome (dn/ds)

regions of positive selection can easily identified. Once a model of

nucleotide evolution has been selected it can be used in the inference of a

tree with increased accuracy in relation to the genetic distances derived from

the alignment. Such models are generally not used for amino acid

alignments due to the increased complexity in dealing with 20 characters

instead of 4. For amino acids rates of evolution are dependent on the

substitution matrix used.

75

Phylogenetic Trees

Once an alignment has been generated and an appropriate model of

sequence evolution has been selected a phylogenetic tree can be inferred.

Trees can be used to graphically depict the relationship among sequences

within the alignment. A tree is a mathematical structure that can be used to

represent relationships between different objects. The tree itself is consisted

of internal nodes, external nodes and edges (Fig. 2.3, Panel A and B). From

a biological point of view the external nodes (green) are the input

sequences. These can also be called leaf nodes. The internal nodes (blue)

represent the ancestral relationships between these sequences. These

eventually converge to the root of the tree – the estimated most recent

common ancestor. Often a closely related sequence, that is not part of the

sampled dataset (red dotted line), will be added to the alignment so that it is

easier to determine where the true root (red) lies within the group. This is

referred to as an outgroup.

Edge lengths connecting the nodes represent the amount of change that

occurs between each node. In figure 2.3 (panels A and B) because the edge

lengths have been taken into account the tree is referred to as an additive

tree. Edge lengths are calculated directly from the alignment and can vary

depending on the model of evolution that is used. A cladogram is a type of

tree that would only show the relationships between strains relative to each

other without any information on evolutionary distances. Dendrograms are a

special kind of additive tree where all the strains are the same distance from

the root. These can be used to infer a molecular clock by determining the

amount of change that has taken place in relation to time.

During the course of this thesis two different tree inference methods,

neighbour joining [24] and maximum likelihood [25], were used. Neighbour

joining falls into a category of tree inference methods referred to as distance-

76

matrix methods while maximum likelihood comes under the category of

discrete data methods [21].

Figure 2.3 Example of a Phylogenetic Tree

(A) Randomly generated tree consisting of 8 hypothetical strains labelled A – H. Internal

node are marked in blue and represent the ancestral relationships between the outer leaf

nodes (green). The edge lengths represent the expected amount of evolutionary change

between each of the nodes. The red dotted line represents a hypothetical outgroup that

could be used to find the root of the tree (red dot). (B) Identical tree but represented using a

different trigonometric pattern. Displaying trees in various patterns is a task of the tree

viewing software and has nothing to do with the actual tree generation method used.

Neighbour Joining

Neighbour joining relies on having the evolutionary distances between all

pairs of sequences within the dataset. These can be simple p distances or

they can incorporate a model of sequence evolution. Sequences are

heuristically clustered two at a time to eventually produce a tree that

represents the phylogenetic relationships within the input dataset [26]. The

algorithm itself maintains an active node list designated, L, and a growing

77

tree, T. Initially all the leaf nodes are assigned to L and T. During first

interation of the algorithm if two nodes from L, say i and j, are found to be

closest to each other then a new node k is created that will be the direct

ancestor of both i and j. The distance between k and each other nodes

within L is then calculated. Once these distances have been obtained k is

added to the tree. The distance between i and k and j and k are than

calculated. k is added to L while i and j are removed from L (but not from T).

The process of selecting nearest neighbours i and j from L is then repeated.

The algorithm ends when there are just two nodes left in L. These are joined

together by an edge and added to the complete tree.

One issue that must be accounted for when re-constructing a neighbour

joining tree is that the closest pair of leaves are not necessarily neighbours

from an evolutionary perspective. When selecting i and j from L the distance

between them is compensated by subtracting the mean distance each of

them to all other nodes within L. According to a proof presented in [26] this

is guaranteed to select the nearest evolutionary neighbours. Programs and

packages that can be used to calculate neighbour joining trees include the

Phylogenetic Analysis Library (PAL) [27], Geneious [28], Clustal [8-10],

PAUP* [29] and Phylip [30].

The neighbour joining algorithm combines speed and accuracy. The tree

produced reflects the relationships defined in an input distance matrix [4, 31]

– where the distances reflect the expected change since any two paired

sequences diverged. However this makes them susceptible to the quality of

these distances and so care must be taken when choosing an appropriate

model of evolution. Because neighbour joining is a clustering algorithm it

does not implicitly optimize the fit between the tree and the data. A side

effect of this is that only one tree is produced and so there is no way of

viewing other potentially reasonable trees [21] nor can any statistical

information about the correctness of branches be generated. Techniques

78

such as bootstrapping are relied upon in order to infer the reliability of branch

order [21]. Bootstrapping is performed by randomly sampling the columns of

the input multiple alignment (Fig. 2.4). For the random sample a tree is then

inferred. This sampling-inference process is usually repeated 1000 times.

The branch order on the correct tree is then compared to the random trees.

In general the more random samples that support a given branch order on

the tree the more reliable that branch order is. The minimum cut off for

reliability is normally about 70%.

Figure 2.4 Bootstrap Analysis

A neighbour joining tree is (red box) is inferred from the input alignment (blue box).

Columns on the input alignment are then randomly sampled (green boxes) and a tree

inferred. This sampling-inference process is repeated – usually 1000 times. Branches on

the correct tree are compared to branches on the trees from the random samples (circle).

Modified from [32].

79

Maximum Likelihood Trees

For the maximum likelihood approach to tree inference the tree selected is

the tree that gives the highest probability of producing the input multiple

sequence alignment [22]. This is slower than Neighbour joining as many

tree topologies must be examined using an appropriate heuristic. Initially a

model of evolution is chosen. For a single tree the likelihood of producing

each site of the input alignment is then calculated by summing the

probabilities of every possible ancestral state. The likelihood for the full tree

is the product of the likelihood at each site. The Felsenstein algorithm can

be used for calculating these likelihood scores for individual trees [18]. This

is then repeated for all possible tree topologies and the tree with the highest

likelihood score is the tree with the highest probability of producing the input

alignment. However the number of possible tree topologies rapidly

increases with the number of taxa. Generally phylogenetic trees are

bifurcating [22]. This is where the number of edges leading out of each

branch is two. For a bifurcating tree the number of rooted trees topologies

possible for n taxa is given by (2n – 3)! [26]. Thus for 2, 3, 4, 5, 6, 7, 8, 9 ,10

taxa the number of possible trees that must be scored are 1, 3, 15, 105, 945,

10,395, 135, 135, 2027025, and 34459425 respectively. With these huge

numbers of trees the use of heuristic algorithms to traverse fitness peaks

within tree space is required in order to find maximum likelihood trees [33].

Programs and packages that implement such heuristics in order to calculate

maximum likelihood trees include PAL [27], Geneious [28], PAUP* [29] and

Phylip [30]. Maximum likelihood, unlike neighbour joining, does optimize the

fit between the tree and the data as it searches for the tree topology that

most likely gave rise to the input dataset. As a result the probability of

branches being correct can be assigned to the tree. Maximum likelihood is a

widely used inference method and is considered to produce the most

accurate result [25].

80

Once a tree has been generated the next step is to use tree viewing

software to help view and analyze the tree. Many such software are

available and the next chapter discusses one program, CTree [34], that was

developed for this project.

81

References

1. Darwin, C., On the origin of species, ed. L.J. Murray. 1859.

2. Sanger, F. and A.R. Coulson, A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J Mol Biol, 1975. 94(3): p. 441-8.

3. Zuckerkandl, E. and L. Pauling, Molecules as documents of evolutionary history. J Theor Biol, 1965. 8(2): p. 357-66.

4. Felsenstein, J., Inferring Phylogenies. 2003(2): p. 664.

5. Koonin, E.V., Orthologs, paralogs, and evolutionary genomics. Annu Rev Genet, 2005. 39: p. 309-38.

6. Needleman, S.B. and C.D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol, 1970. 48(3): p. 443-53.

7. Chenna, R., et al., Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res, 2003. 31(13): p. 3497-500.

8. Higgins, D.G., J.D. Thompson, and T.J. Gibson, Using CLUSTAL for multiple sequence alignments. Methods Enzymol, 1996. 266: p. 383-402.

9. Larkin, M.A., et al., Clustal W and Clustal X version 2.0. Bioinformatics, 2007. 23(21): p. 2947-8.

10. Thompson, J.D., D.G. Higgins, and T.J. Gibson, CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, 1994. 22(22): p. 4673-80.

11. Edgar, R.C., MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res, 2004. 32(5): p. 1792-7.

12. Henikoff, S. and J.G. Henikoff, Automated assembly of protein blocks for database searching. Nucleic Acids Res, 1991. 19(23): p. 6565-72.

82

13. Dayhoff, M., Survey of new data and computer methods of analysis. Atlas of protein sequence and structure, 1978. 5.

14. Smith, T.F., Waterman, M.S., Comparison of biosequences. Advanced and Applied Mathametics, 1981. 2.

15. Salemi, M., Vandamme, A.M., The Phylogenetic Handbook: A Practical Approach to DNA and Protein Phylogeny. 2004: Cambridge University Press. 406.

16. Jukes, T.H., Cantor, C., Evolution of Protein Molecules. In Mammalian Protein Metabolism. 1969: New York: Academic Press.

17. Kimura, M., A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol, 1980. 16(2): p. 111-20.

18. Felsenstein, J., Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol, 1981. 17(6): p. 368-76.

19. Hasegawa, M., H. Kishino, and T. Yano, Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol, 1985. 22(2): p. 160-74.

20. Rodriguez, F., et al., The general stochastic model of nucleotide substitution. J Theor Biol, 1990. 142(4): p. 485-501.

21. Page, R.D.M.a.H., E.C., Molecular Evolution: a Phylogenetic Approach. 1998.

22. Nei, M.K., S., Molecular Evolution and Phylogenetics. 1333 ed. 2000: Oxford University Press.

23. Miyata, T., T. Yasunaga, and T. Nishida, Nucleotide sequence divergence and functional constraint in mRNA evolution. Proc Natl Acad Sci U S A, 1980. 77(12): p. 7328-32.

24. Saitou, N. and M. Nei, The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol, 1987. 4(4): p. 406-25.

83

25. Steel, M. and D. Penny, Parsimony, likelihood, and the role of models in molecular phylogenetics. Mol Biol Evol, 2000. 17(6): p. 839-50.

26. Durbin, R., Eddy, S., Krogh, A., Mitchison, G., Biological sequence analysis: Probabilistic models of proteins and nucleic acids. 1998: Cambridge University Press.

27. Drummond, A. and K. Strimmer, PAL: an object-oriented programming library for molecular evolution and phylogenetics. Bioinformatics, 2001. 17(7): p. 662-3.

28. Drummond, A.J., Ashton, B., Cheung, M., Heled, J., Kearse, M., Moir, R., Stones-Havas, S., Thierer, T., Wilson, A., Geneious. 2007. p. http://www.geneious.com/.

29. Swofford, D.L., PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). 2003, Sinauer Associates, Sunderland, Massachusetts.

30. Felsenstein, J., PHYLIP (Phylogeny Inference Package). 2005, Department of Genome Sciences, University of Washington, Seattle.

31. Takahashi, K. and M. Nei, Efficiencies of fast algorithms of phylogenetic inference under the criteria of maximum parsimony, minimum evolution, and maximum likelihood when a large number of sequences are used. Mol Biol Evol, 2000. 17(8): p. 1251-8.

32. Baldauf, S.L., Phylogeny for the faint of heart: a tutorial. Trends Genet, 2003. 19(6): p. 345-51.

33. Whelan, S., New approaches to phylogenetic tree search and their application to large numbers of protein alignments. Syst Biol, 2007. 56(5): p. 727-40.

34. Archer, J. and D.L. Robertson, CTree: comparison of clusters between phylogenetic trees made easy. Bioinformatics, 2007. 23(21): p. 2952-3.

84

Chapter 3: CTree - comparison of clusters between phylogenetic trees made easy

Abstract CTree has been designed for the quantification of clusters within viral

phylogenetic tree topologies. Clusters are stored as individual data

structures from which statistical data, such as the Subtype Diversity Ratio

(SDR), Subtype Diversity Variance (SDV) and pairwise distances can be

extracted. This simplifies the quantification of tree topologies in relation to

inter and intra cluster diversity. Here the novel features incorporated within

CTree, including the implementation of a heuristic algorithm for identifying

clusters, are outlined along with the more usual features found within general

tree viewing software. CTree is available as an executable jar file from:

http://www.manchester.ac.uk/bioinformatics/ctree.

85

Introduction

There are many programs available for viewing phylogenetic trees. A

comprehensive list of such programs, maintained by Joe Felsenstein can be

found at http://evolution.genetics.washington.edu/phylip/software.html. Few

however are specialized in the quantification of clustering present on such

trees. Quantifying clusters is biologically important in relation to both

vaccine design and epidemiology [1, 2]. Within CTree (Fig. 3.1) clusters of

strains can be treated as unique data structures – allowing for easy

quantification and comparison. Clusters can be manually populated by the

user or alternatively by use of a novel heuristic clustering algorithm. Use of

such an algorithm removes the subjectivity of the individual when allocating

strains to individual clusters e.g. on random trees or on trees with no

previously identified clusters. CTree also incorporates other useful features

for phylogenetic analysis that are rarely included within current tree viewing

software. The input for CTree is a tree string in NEWICK format (e.g. phylip

output .ph or .phb). Trees can be saved as a pdf, in NEWICK format or as

binary files that will maintain any edits made to them. Implementation was

done using the Java SDK and so CTree is platform independent.

Novel Features

(i) Heuristically Defining Clusters: If the number of taxa present on the tree is

less than 125 a heuristic algorithm can be used to allocate individual strains

to clusters on the tree topology. This feature is useful when no previous

clusters have been published for a particular dataset, as well as for finding

clusters on random trees in a systematic manner (such trees can be used as

a control).

86

(ii) Manually Defining Clusters: The user can manually assign strains to

clusters. This allows for the comparison between previously identfied

clusters on different trees e.g. trees representing HIV-1 groups M and O.

Figure 3.1 Interface for CTree

Screen shot of CTree containing a neighbour joining tree of HIV-1 group M envelope

reference strains. The colours indicate clusters that were manually selected based on the

LANL HIV Sequence database designations (http://hiv-web.lanl.gov). When the heuristic

clustering algorithm is used a similar cluster set for group M is produced. For each cluster

the centre cluster strain (as indicated by the labels) is the strain with the minimum mean

pairwise distance to all other strains within the cluster.

(iii) SDR And SDV: These two statistics are used to quantify the degree of

clustering present on tree topology. The SDR, is defined as the ratio of the

mean intra cluster pairwise distance to the mean inter cluster pairwise

distance [3]. Low intra pairwise distances relative to inter pairwise distances

imply the presence of more defined clusters. The SDV is a measure of the

87

variation within the ratio of the mean intra- cluster pairwise distance to the

mean inter- cluster pairwise distance calculated for each cluster on the tree

[1]. The lower the SDV the more symmetrical the clusters present. Together

these two statistics quantify the presence of clustering within tree topologies.

(iv) Working with Random Trees: CTree provides a random birth model of

exponential population growth for the generation of random trees. The user

can generate a single random tree or specify a number of random trees to

be generated. Terminal nodes from trees generated can be randomly

sampled. Clustering analysis can be performed on the trees with the end

result being a null distribution for the various statistics (including SDR and

SDV).

(iv) Finding the Center of the Tree (COT): COT is the point on a tree with the

smallest average distance from each of the strains on the tree. As well as

having implications for vaccine design [2] it is a useful reference point on a

tree.

The Heuristic clustering Algorithm

The algorithm (Fig. 3.2) is based on finding the cluster set with the minimum

SDR value [3]. The runtime is in the order of n3 where n is the total number

of vertices within the tree. There are two phases:

(i) The Explore Phase: Initially m cluster sets for the input tree are created

during m iterations of steps 1– 6: m is dependent on the number of

incrementations used for the threshold distance (t) in step 2.

Step 1: Each strain on the tree is designated as a potential central cluster

strain – each representing a potential cluster.

88

Step 2: Potential clusters are individually populated by adding strains falling

within t from the central cluster strain.

Step 3: Potential clusters are sorted (largest first). All strains within the

largest potential cluster are removed from all other potential clusters. The

remaining potential clusters are resorted. This step is repeated until there

are no duplicated strains between clusters.

Step 4: Strains belonging to each potential cluster are checked to see

whether or not they are more suited to a different potential cluster. Suitability

is based on an individual member’s proximity to its current potential central

cluster strain and its proximity to central cluster strain’s representing other

potential clusters.

Step 5: The SDR value for the remaining populated potential clusters is

calculated and stored.

Step 6: t is incremented by a predefined amount.

(ii) The Selection Phase: In the selection phase the potential cluster set that

produced the lowest SDR value is selected as the cluster set to represent

the input tree.

89

Figure 3.2 Clustering Sequences on a Tree

Diagramatic representation of the algorithm implemented within CTree for automatically

finding clusters. The upper dashed box corresponds to phase 1 of the algorithm while the

lower dashed box corresponds to phase 2.

90

Standard Features

CTree provides the user with many standard tree viewing features including:

(i) re-rooting the tree, (ii) obtaining statistical information such as pairwise

distances between strains, (iii) swapping the order of sibling strains (iv)

manually removing strains from the tree, (v) removing strains randomly/non-

randomly from the tree, (vi) an improved search interface that allows the

user to color strains based on search criteria, (vii) basic coloring of the tree,

(viii) loading multiple trees from a file containing more than one tree and

conveniently scrolling through them, (ix) allowing the user to obtain lists of

strains within a user specified proximity to each other and (x) allowing the

user to define the distance covered by the scale bar associated with the tree.

Typical Usage

With the range of tree viewing features available within CTree the user can

perform all of the functionalities that would be required to create a

publishable phylogenetic tree. The novel features of CTree do not

complicate this process. A typical usage is illustrated in the study of the

clusters present within HIV-1 group M and O data [1]. Here CTree was used

to manually divide multiple trees from each of the group M and O datasets

into clusters. SDR and SDV values calculated from these clusters were then

compared to each other as well as to heuristically defined clusters present

on 1000 randomly generated trees. The end result was the production of a

definitive model of HIV-1 subtype emergence and the highlighting of the

misleading nature of the group M subtype classification system when used at

the center of the pandemic.

91

Acknowledgements

We are very grateful to Andy Rambaut for his insightful comments and his

advice concerning the generation of random trees. We would also like to

thank John Pinney for related discussions. JA is supported by BBSRC

studentship.

92

References 1. Archer, J. and D.L. Robertson, Understanding the diversification of

HIV-1 groups M and O. Aids, 2007. 21(13): p. 1693-700.



93


Abstract Objective: To quantify the similarity (or lack of) between the phylogenetic

substructure of HIV-1 groups O and M.

Methods: Two phylogenetic tree statistics the subtype diversity ratio, SDR,

and the subtype diversity variance, SDV, were used in conjunction with

bootstrap replicates on gag, pol and env sequence alignments of group O

and M strains. Randomly generated phylogenetic trees were used as a

control.

Results: We show that as expected the established global-group M

subtypes have a high degree of phylogenetic symmetry in relation to each

other in terms of inter- and intra-subtype diversification. They are

significantly different from the substructure present amongst the random

trees. To the contrary the group O diversification does not display this highly

symmetrical substructure and is not significantly different from the

substructure present on randomly generated trees. Phylogenies comprised

of group M strains from the epicentre of the HIV/AIDS pandemic, the

Democratic Republic of Congo (DRC), exhibit a substructure more similar to

group O than to global-group M.

Conclusions: The substructure present within groups O and M is

quantifiably different. The well-defined clades, the subtypes that

characterize the group M diversification, are not present in group O or

amongst group M strains from the DRC. The group M subtypes are thus

unique and a signature of pandemic HIV-1.

94

Introduction Three different cross species transmission events involving Simian

Immunodeficiency Virus (SIV) from apes have resulted in three distinct

phylogenetic lineages of HIV-1 in humans [1]. These lineages termed

“groups” are labeled M (main), O (Outlier) and N (non-M/O) [2]. Group M is

almost entirely responsible for the global HIV/AIDS pandemic and was

established in the human population as the result of a single cross species

transmission event of SIV from the chimpanzee subspecies Pan troglodytes

troglodytes [3, 4]. Group O has remained endemic to Cameroon and is also

probably the result of a single cross species transmission event involving P.

t. troglodytes. SIVs more closely related to HIV-1 group O have been

recently isolated from the gorilla subspecies Gorilla gorilla gorilla but, as with

humans, it is most likely that they have been acquired from P. t. troglodytes

[1]. HIV-1 group N also most probably originated from P. t. troglodytes [5]

and has been restricted to a few Cameroonians [6].

The HIV-1 strains that comprise group M cluster into distinct and strongly

supported clades in phylogenetic trees. These clusters are referred to as

“subtypes” and have been labeled A to D, F to H, J and K [2]. “E” and “I”

have been relabeled as “Circulating Recombinant Forms” CRF01 and

CRF04, respectively, of which 34 are now designated (LANL HIV Sequence

database, http://hiv-web.lanl.gov). These CRFs are intersubtype

recombinants descended from the same recombination event(s) that infect

multiple individuals and form distinct phylogenetic clusters. With the

exception of subtype K, which is restricted to Central Africa, the subtypes are

distributed within one or more risk groups on a global scale (LANL HIV

Sequence database). Contribution to the group M pandemic by each of the

subtypes varies greatly with C (47%), A (27%) and B (12%) being the most

prevalent [7].

95

It has been estimated that the most recent common ancestor for all group M

strains existed during the early 1930’s [8] and that the DRC was the most

probable location of this strain [9]. Uniquely for a geographical region strains

related to all the nine subtypes have been identified in the DRC along with a

large number of unclassified strains. A high degree of intra-subtype diversity

has also been observed within the region with many strains falling basal to

the subtype ancestral node [10]. The latter study further showed that group

M strains from the DRC had very little organized substructure compared to

their global counterparts, and surmised that the highly structured global

subtypes were a result of chance exportations of DRC strains to new

susceptible global risk groups.

In contrast to group M, strains belonging to group O have been

geographically restricted to Cameroon where in the mid-nineties the group

was responsible for only 1% of HIV-1 infections in the far north of the

country, while in the capital it was responsible for about 6% [11]. Other

studies have shown that in certain areas of the country the prevalence is

lower. For example in the Northwest less than 1% of HIV infections were

observed to be caused by group O [12]. Very few isolated cases outside of

Cameroon have been documented [13] and all have been epidemiologically

linked to Cameroon. It has been estimated that group O’s most recent

common ancestor was present in the human population at about the same

time as group M’s (the 1930s with some considerable error) based on

comparable sequence divergence [14] and more detailed dating-analysis

[15].

The significance of the HIV-1 subtypes, and whether or not group O can be

subdivided into clusters that are biologically equivalent to group M subtypes,

is unclear [14, 16-18]. Resolution of this issue will have implications for any

future group O vaccine design and possibly for drug based intervention

strategies [19]. In this study we have addressed this question by using

96

bootstrapping and two phylogenetic tree metrics that quantitatively measure

the extent of strain clustering. We used these metrics to compare trees

constructed from HIV strains from groups M and O. Randomly constructed

phylogenetic trees were used in the analysis as a control. We extend our

comparative study to group M strains from the DRC and generate a model

for HIV-1 diversification. We do not analyze group N as very few strains let

alone sequences are available.

Methods Tree Metrics. For a phylogenetic tree the subtype diversity ratio (Fig. 4.1, Panel A), SDR, is defined as the ratio of the mean within-cluster (intra)

pairwise distance to the mean between-cluster (inter) pairwise distance [10].

The SDR is therefore a quantitative measure of the extent of clustering found

within a tree. Low intra-cluster pairwise distances relative to inter-cluster

pairwise distances implies more defined clustering in the tree. Thus trees

with lower SDR values are characterized by well defined clusters. An SDR

approaching one would indicate a lack of clustering is present in the tree. As

the SDR does not take into account the variability that can occur between

individual clusters the subtype diversity variance (Fig. 4.1, Panel B), SDV,

was devised. The SDV statistic is a measure of the variation within the ratio

of the mean intra-cluster pairwise distance to the mean inter-cluster pairwise

distance calculated for each cluster on the tree. The lower the SDV value

the more symmetrical, or equidistant, the clusters in a tree are relative to

each other.

97

Figure 4.1 Subtype Diversity Ratio and Subtype Diversity Variance

(A) The relationship between the SDR and the quality of clustering on a phylogenetic tree as

defined in [10] (B) The relationship between the SDR and the variability of the quality of

clustering.

Datasets. To investigate group O and “global” group M phylogenies

sequences from the p24 (gag), p32 (pol) and gp160 (env) genomic regions

were analysed. For each region all of the group O sequences available in

the LANL HIV Sequence Database were obtained. Sequences belonging to

each region were aligned separately using CLUSTAL W [20]. Gap-

containing sites in the alignments were excluded. To exclude clones and

sequences likely to be from the same host sequences that were related by

more than 95% were identified and for any such pairs, or groups, only one

random representative was retained. The final group O datasets contained

98

54, 46 and 53 sequences, respectively. For the global group M data

because there were far more sequences available, a sampling process was

implemented. Strains from the center of the pandemic (DRC) were removed

from this dataset. For each region 110 strains were sampled 100 times from

the LANL HIV Sequence Database. Each of these datasets was then

processed in a similar manner to the group O datasets.

For the group M DRC data, 195 partial env sequences (V3-V5 region)

sampled in 1997 [9] along with 56 sequences sampled during the mid 1980’s

[21] were obtained from the LANL HIV Sequence Database. After

designated recombinants were removed, leaving a total of 230 sequences,

these two datasets were aligned. Columns containing gaps were stripped

from the alignment. A phylogenetic tree was inferred as described below.

The global gp160 sequences from the group M representative tree (see next

paragraph), 288 sequences sampled in 2002 [22] and two sequences

described in [23] were then aligned with the mid 1980’s and 1997 sequences

so that a phylogenetic tree that represented all of the available group M

diversity could be inferred.

Phylogenetic Analysis. The Phylogenetic Analysis Library [24] was used to

create neighbor joining trees for each of the datasets described. The HKY85

model of nucleotide substitution was used with a transition/transversion ratio

of 2. Each tree was divided into clusters from which SDR and SDV values

were calculated. The software implemented to do this analysis (CTree) is

available at http://www.manchester.ac.uk/bioinformatics/resources. The

identification of group O clusters was based on previous analysis [18], while

the group M subtypes were based on the designations that were present in

the LANL HIV sequence database. For each of the group M p24, p32 and

gp160 regions the tree with the SDR value closest to the overall mean SDR

for a given region was chosen as a representative tree for that region.

Bootstrap analysis using PAUP* [25] was performed and the number of

99

bootstrap replicates out of 1,000 supporting each of the subtypes in these

representative trees recorded. For the three group O trees bootstrap values

for the defined clusters were obtained in a similar way.

Random Trees. 1,000 random trees each containing 1,000 terminal nodes

were generated using a random birth model of exponential population

growth. In order to simulate the random sampling of HIV strains from larger

host populations 50 strains were randomly selected from each of the trees.

The trees were divided into clusters using a heuristic algorithm that

minimized the SDR – a similar process to the one described in [10]. SDR

and SDV values where then calculated for each of the trees. CTree was

also used to generate the trees, perform the sampling and divide the trees

into clusters.

Results

The representative tree (Fig. 4.2, Panel A) for the group M envelope region

displays the characteristic ‘starburst’ structure that apparently defines this

group’s phylogenetic diversification. This phylogenetic substructure is

characterized by a double-star phylogeny, i.e., a tendency for long branch

lengths within subtypes that coalesce near the ancestral node of the subtype

and long pre-subtype branches that coalesce near the root of the entire tree

[26]. As a result strains within any given subtype are always more closely

related to each other than they are to strains belonging to a different

subtype. Subtypes that were supported by a bootstrap value of 90% or

greater have been marked with an ‘*’. The trees for the p24 and p32 regions

(Appendix I, Panels A and B) also exhibit this starburst tendency and high

bootstrap support for the designated subtypes. The mean SDR values for

the group M p24, p32 and gp160 regions were 0.47 (±0.004), 0.50 (±0.003)

and 0.49 (±0.002), respectively. The global-group M SDR distribution is

significantly lower than the distribution produced by the randomly generated

100

trees (Fig. 4.3), indicating that the global-group M subtypes are significantly

better defined than would be expected under a random process. The group

M SDV values for the same regions were 0.014 (±0.001), 0.012 (±0.0005)

and 0.017 (±0.0002). These low variances reflect the highly symmetrical or

equidistant nature of the group M phylogenies substructure.

Figure 4.2 Phylogenetic History of HIV-1 Group M and Group O

Inference of the evolutionary histories (phylogenetic trees) of HIV-1 groups M and O. Panel

A displays a tree that was re-constructed using global-group M envelope gp160 sequences.

In panel B the tree was re-constructed using group O gp160 sequences. In panel C the tree

was re-constructed using group M envelope V3-V5 sequences isolated from the DRC region

101

in 1997 [10]. The ‘*’ indicates bootstrap support greater than 90%. The scale bar

corresponds to nucleotide substitutions per site. In panels A and B the bold lettering

corresponds to the subtype designations from the LANL HIV Sequence Database. In panel

C the bold numbers represent the clusters that correspond to previously proposed clusters

[18].

Figure 4.3 Subtype Diversity Ratio Distributions

Frequency distribution of the SDR statistic for global group M and random phylogenies, and

individual group O and DRC phylogenies. The dark grey bars represent the distribution of

SDR values obtained from the global group M data across the p24, p32 and gp160 regions

of the genome. The light grey bars represent the distribution of SDR values that were

obtained from the randomly generated phylogenetic trees. “O” is the location of the mean

SDR value obtained from group O phylogenies for each of the p24, p32 and gp160 regions

of genome. “M_DRC” is the SDR value obtained from a group M phylogeny from the DRC

region in 1997 for the V3-V5 region of the genome.

In the group O phylogenetic tree for the gp160 region (Fig. 4.2, Panel B) the

highly symmetrical starburst tendency is absent. The topology consists of

weakly defined clusters with the cluster ancestral nodes falling deep within

the tree. This weak clustering is maintained across the p24 and p32 regions

of the genome (Appendix II, Panels A and B). The SDR values for the p24,

p32 and gp160 regions were 0.58, 0.55 and 0.58 respectively. In each case

these are significantly larger than the corresponding group M values (t-test, p

< 0.001) confirming the weak clustering. Furthermore the mean group O

102

SDR value of 0.57 (labeled by “O”) falls inside the SDR distribution produced

by the random trees (Fig. 4.3). For group O the p24, p32 and gp160 regions

produced SDV values of 0.07, 0.03 and 0.03, respectively, indicating a lack

of symmetry between the defined clusters. These values are all significantly

larger than the corresponding group M values implying that the group O

clusters are quantifiably less symmetrical than their group M counterparts.

As a result of the weakly defined group O clusters it is not possible to be

confident that any two individual strains from a given cluster are more closely

related to each other than they are to strains from a different cluster. For

example strains O.CM.98.98CMA123 and O.CM.96.96CMABB009 both

belonging to cluster III and have an evolutionary distance between them of

0.188 nucleotide substitutions per site. However the distance between

O.CM.98.98CMA123 and O.CM.98.98CMA307, which belongs to cluster II,

is 0.185 nucleotide substitutions per site. Bootstrap support for the group O

cluster ancestral node is also not as consistent as for the group M trees. For

example on the tree representing the p32 region only two clusters have a

bootstrap support of more than 90%.

We also performed the SDR/SDV analysis on phylogenies inferred from the

V3-V5 region of env. 100 truncated group M sequences (corresponding to

the strains in Fig. 4.2, Panel A) and all 97 group O sequences available in

this region were used. The SDR for group O was 0.632 while the SDV was

0.033, and for group M the values were 0.54 and 0.01, respectively. The

group O SDR value still falls within the random tree SDR distribution

presented in figure 3 while the group M SDR value fell outside 95% of the

random values. In the case of the group M tree (not shown), clusters

representing subtypes were still clearly defined (as in Fig. 4.2, Panel A).

This was not the case for the group O trees (not shown) where the topology

was similar to Fig. 4.2, Panel B. Note, we do not analyze complete genome

group O sequences as there are only 21 available. These are not a

103

representative sample of group O diversity and this number is also too low

for reliable testing.

Fig. 4.2, Panel C shows the tree re-constructed from the group M data for

the V3-V5 region sampled from the DRC [9, 21]. The characteristic well

defined starburst shape that defines the global-group M tree is absent. This

is due to the shorter evolutionary branch lengths that define the subtypes A,

D, J, G, F and K being less prominent. In Fig. 4.3 the SDR value of 0.68

(labeled “M_DRC”), obtained from the 1980’s and 1997 DRC strains, falls

inside the SDR distribution produced by the random trees and significantly

outside the SDR distribution produced by the global-group M, consistent with

Rambaut’s et al.’s previous study [10]. However the SDV value for the group

M DRC data (0.0045) was significantly lower than the SDV produced by the

global-group M data indicating the presence of older more established

endemic lineages. In a similar way to the group O clusters and unlike the

global-group M subtypes, two individual strains belonging to the same

subtype are not necessarily more closely related to each other that they are

to a strain belonging to a different subtype. For example strains

A.CD.97.MBS26 and A.CD.97.KP28 both labeled as subtype A have an

evolutionary distance between them of 0.193 nucleotide substitutions per

site. However the distance between A.CD.97.KP28 and D.CD.97.KS2,

which is labeled as subtype D is less (0.169 nucleotide substitutions per

site). In Fig. 4.4, the strains that make up figures 4.2 Panels A and C are

combined with the 288 DRC sequences sampled in 2002 [22] as well as the

2 strains described in [23]. The SDR value for this tree was 0.67 while the

SDV was 0.009. Strains that have been isolated from within the DRC region

(black) can be observed to fall deep within the tree in relation to the clusters

of globally isolated strains (red). Indeed the diversity within the subtypes

after the inclusion of the strains from the DRC has increased markedly.

104

Figure 4.4 The Center of the Group M Pandemic

Inference of the evolutionary history of global group M and DRC HIV-1 strains. The black

branches correspond to DRC data from [9, 21-23]. The red branches correspond to group

M global data seen in Fig. 4.2, Panel A.

Discussion

In this study we have demonstrated significant differences between strains

falling into HIV-1 groups M and O in relation to their pattern of clustering in

phylogenetic trees. In contrast to group M the significantly larger SDR

values (Fig. 4.3) produced by the previously proposed [17, 18] group O

105

clusters (Fig. 4.2, Panel B), defined for the gag, pol and env regions of the

genome, confirm and quantify their weaker structure relative to the distinct

group M subtypes (Fig. 4.2, Panel A). The lack of equidistant group O

clusters or lack of a ‘starburst’ structure is observed in Fig. 4.2, Panel B, and

is quantified by the significantly higher SDV values when compared to group

M. It would seem that the group O phylogeny reflects an epidemiology that

is dependent on host transmissions on a highly localized (endemic) scale

[26]. As a result the branches leading to the group O clusters tend to be

deeper within the trees resulting in shorter branch lengths leading to the

individual clusters. This also results in weaker bootstrap support for the

clusters.

To the contrary the group M global subtypes have emerged during the

course of the complicated epidemiological history arising from the HIV/AIDS

pandemic. The nature of the global subtype emergence is quantified by the

significant difference obtained between the group’s SDR distribution and the

SDR distribution produced from tree topologies generated under a random

model of exponential population growth (Fig. 4.3). The latter represents

phylogenetic topologies that would be expected within an exponentially

expanding endemic viral population where founder effects are absent [10].

The distinctness of the global-group M subtypes is thus a direct result of the

history of the pandemic and the spread of HIV/AIDS across the world. The

absence of strong founder effects followed by relatively isolated

diversification in relation to group O has dramatically reduced the extent of

strain clustering within this group’s phylogeny. As a result the group O

cluster SDR values fall significantly inside the values produced by clusters

defined on randomly generated trees (Fig. 4.3).

In light of these quantifiable differences that exist between the phylogenies

of HIV-1 groups M and O, we propose in agreement with earlier work [14]

that, despite both groups being in existence for a similar length of time, it is

106

not appropriate to draw direct parallels between the proposed clusters that

exist within the group O phylogeny and the well established global subtypes

that exist within the group M phylogeny. Group O clusters do exist but they

have evolved on a highly localized scale and are less distinct and less

epidemiologically informative that group M subtypes.

In a similar manner to group O, group M phylogenies including sequences

from Central Africa, the epicentre of the pandemic (reviewed in [27]), do not

exhibit the strong effect of strain exportation (founder effects). This has

resulted in the presence of weaker clusters within phylogenetic trees re-

constructed from strains sampled within the region and is observed in Fig.

4.2, Panel C where subtype ancestral nodes fall deep within the tree. This

weaker clustering of strains is quantified by the SDR value calculated from

the tree, which like the values for group O fall significantly within the SDR

distribution from the random trees (Fig. 4.3). When global-group M strains

are added to a phylogenetic tree, re-constructed from HIV strains from the

DRC region, they continue to form tight clusters amongst the more diverse

DRC strains (Fig. 4.4). Many of the DRC strains fall on the pre-subtype

branches that previously distinguished the global subtypes and thus the

clear distinction between the subtypes is lost. As a consequence within the

DRC region the current global subtype classification system is not

particularly meaningful.

Fig. 4.5 depicts our proposed model of how the current global group M

subtypes diversified. The importance of different globally isolated host sub-

populations (circled in grey) as well as recombination is emphasized. In the

model after the initial successful cross species transmission event, a huge

diversification occurred within the relatively isolated DRC region resulting in

loosely defined divergent lineages that reflect localized viral epidemiology.

The initial founder effects caused by the random exportation of strains out of

the epicentre into previously unexposed globally isolated host populations,

107

followed by subsequent rapid diversification within such populations, has

resulted in the very well defined global subtypes with a large genetic

distance between each. Note that at no point is recombination absent,

rather it occurs across a continuum of sequence divergence – between

identical strains, strains of the same subtype and from different subtypes –

with varying degrees of probability related to the sequence identity and other

factors [28-33]. Occasionally recombinants will form epidemiologically

significant clusters, for example the CRFs [2]. These can be the result of

relatively recent recombination such as the CRF10_CD, which is found in

Tanzania or as a result of much older recombinant events that can be traced

back to the early days of the pandemic, for example CRF02_AG, which can

be found in Central Africa. In such a case being a CRF is indistinguishable

from being a subtype, in that they merely represent an exportation event

from the DRC that involved a recombinant that happened to be relatively

more closely related to the strain(s) that formed subtypes A and G.

According to our model the global classification system cannot reflect the

extent of diversity present within the epicentre of the pandemic.

108

Figure 4.5 Model of HIV-1 Group M Subtype Emergence

Diagrammatic model representing the evolution and diversification of HIV-1 group M. Each

circle represents a defined phylogenetic grouping. The central circle represents the

epicentre of the group M pandemic focused on the DRC region. Note this is merely a

representation as the epicentre will not follow country boundaries. The grey circles

represent random strains that were exported from the epicentre. These strains are linked to

the founder effects in the new host populations (represented by the outer circles) in different

geographic regions. The bottleneck event that occurred during the initial colonization of

these new populations resulted in the apparently long evolutionary branches (dotted lines)

that lead to each of the current group M global subtypes. Inclusion of sequences from

Central Africa infections (particularly the DRC) blurs this distinction because so many strains

fall on the pre-subtype branches. Within each isolated host population the evolutionary

factors driving diversification will be similar and include the frequent occurrence of viral

mutation, replication recombination [28-33], and positive/diversifying selection [34, 35].

109

The subtype classification system has been devised through the sampling of

strains globally and presently strains from the DRC region are classified

according to the global subtype that they fall closest to. However the

distinctness of these “globally”-defined “pre-DRC” data subtypes is due to

individual strains being exported from the DRC region. They do not

accurately represent the diversity that is present within the centre of the

pandemic. Classifying strains from the DRC region according to the current

global classification system misrepresents this extensive diversity and is

potentially misleading. For example, a proposal for putative new subtypes

within the DRC region, for strains 83CD003 and 90CD.121E12 [23], should

be treated with caution and not labeled as a subtype [22] at this time. Such

divergent strains (Fig. 4.4) are a consequence of the diversity of HIV-1 in

Central Africa and as such are not epidemiologically significant in the same

way as the pandemic-associated clusters. The HIV-1 diversity in the DRC

represents a continuum of genetic variation, with the previous distinctness of

the subtypes an artifact of both (1) biased sampling, and (2) strong founder

events as viruses were exported outside of the DRC region, seeding the

HIV/AIDS pandemic (Fig. 4.5). In this context, the identification of 83CD003

and 90CD121E12 as a “new clade” (and as a potential subtype if a related

strain is identified) is of little consequence. The point is not that these DRC

infections are unimportant but that any lineage, such as the one identified by

Mokili et al. [23], should be recognized as a small part of the overall diversity

present in Central Africa, even when it forms an apparently unique lineage.

Crucially, conclusions concerning global-group M subtypes in relation to

vaccine programs using subtype consensus sequences and geographical

knowledge, cannot be directly transferred to the weakly supported group O

clusters or the poorly defined clusters present in the DRC region. When

choosing strains for the generation of a potential consensus vaccine [19] it

will be important to focus on sequences that are relevant to a particular

geographic region that are circulating at the present time, and not bias the

110

inference by the inclusion of highly divergent strains. Indeed subtypes given

their historical diversity will often be of limited relevance to the strains that

are likely to be circulating in the future and for which a hypothetical vaccine

will have to elicit immunity to. A more predictive approach needs to be taken

if rational vaccine design is to become a reality.

In conclusion, the significance of the HIV-1 group M subtypes has been a

puzzle for many years. Here we present a model that is a definitive

representation of HIV diversification and evolution. Any subtype specific

differences if they exist will have most probably arisen outside of the DRC

and will be the result of passaging of the virus in association with different

risks groups or the result of frequent passaging in association with relatively

large infected populations. When analysing HIV particularly in the case of

vaccine design it is crucial that the consideration of its diversity be grounded

in an explicit evolutionary framework.

Acknowledgements

We are very grateful to Andy Rambaut for his insightful comments and his

advice concerning the generation of random trees. We also wish to thank

Michael Worobey and John Pinney for helpful comments and discussion. JA

is supported by BBSRC studentship.

111

References

1. Van Heuverswyn, F., et al., Human immunodeficiency viruses: SIV infection in wild gorillas. Nature, 2006. 444(7116): p. 164.


3. Bibollet-Ruche, F., et al., Complete genome analysis of one of the earliest SIVcpzPtt strains from Gabon (SIVcpzGAB2). AIDS Res Hum Retroviruses, 2004. 20(12): p. 1377-81.

4. Gao, F., et al., Origin of HIV-1 in the chimpanzee Pan troglodytes troglodytes. Nature, 1999. 397(6718): p. 436-41.

5. Keele, B.F., et al., Chimpanzee reservoirs of pandemic and nonpandemic HIV-1. Science, 2006. 313(5786): p. 523-6.

6. Yamaguchi, J., et al., Identification of HIV type 1 group N infections in a husband and wife in Cameroon: viral genome sequences provide evidence for horizontal transmission. AIDS Res Hum Retroviruses, 2006. 22(1): p. 83-92.

7. Takeb, E.Y., S. Kusagawa, and K. Motomura, Molecular epidemiology of HIV: tracking AIDS pandemic. Pediatr Int, 2004. 46(2): p. 236-44.

8. Korber, B., et al., Timing the ancestor of the HIV-1 pandemic strains. Science, 2000. 288(5472): p. 1789-96.



11. Mauclere, P., et al., Serological and virological characterization of HIV-1 group O infection in Cameroon. Aids, 1997. 11(4): p. 445-53.

112

12. Yamaguchi, J., et al., HIV infections in northwestern Cameroon: identification of HIV type 1 group O and dual HIV type 1 group M and group O infections. AIDS Res Hum Retroviruses, 2004. 20(9): p. 944-57.

13. Quinones-Mateu, M.B., SC. Arts, EJ., Role of Human Immunodeficiency Virus Type 1 Group O in the AIDS Pandemic. AIDS Rev, 2000. 2: p. 190–202.


15. Lemey, P., et al., The molecular population genetics of HIV-1 group O. Genetics, 2004. 167(3): p. 1059-68.

16. Loussert-Ajaka, I., et al., Variability of human immunodeficiency virus type 1 group O strains isolated from Cameroonian patients living in France. J Virol, 1995. 69(9): p. 5640-9.

17. Yamaguchi, J., et al., Near full-length genomes of 15 HIV type 1 group O isolates. AIDS Res Hum Retroviruses, 2003. 19(11): p. 979-88.

18. Yamaguchi, J., et al., Evaluation of HIV type 1 group O isolates: identification of five phylogenetic clusters. AIDS Res Hum Retroviruses, 2002. 18(4): p. 269-82.

19. Heeney, J.L., A.G. Dalgleish, and R.A. Weiss, Origins of HIV and the evolution of resistance to AIDS. Science, 2006. 313(5786): p. 462-6.

20. Thompson, J.D., D.G. Higgins, and T.J. Gibson, CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, 1994. 22(22): p. 4673-80.

21. Kalish, M.L., et al., Recombinant viruses and early global HIV-1 epidemic. Emerg Infect Dis, 2004. 10(7): p. 1227-34.

22. Vidal, N., et al., Distribution of HIV-1 variants in the Democratic Republic of Congo suggests increase of subtype C in Kinshasa

113

between 1997 and 2002. J Acquir Immune Defic Syndr, 2005. 40(4): p. 456-62.

23. Mokili, J.L., et al., Identification of a novel clade of human immunodeficiency virus type 1 in Democratic Republic of Congo. AIDS Res Hum Retroviruses, 2002. 18(11): p. 817-23.

24. Drummond, A. and K. Strimmer, PAL: an object-oriented programming library for molecular evolution and phylogenetics. Bioinformatics, 2001. 17(7): p. 662-3.

25. Swofford, D., PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). 2003, Sinauer Associates, Sunderland, Massachusetts.

26. Grenfell, B.T., et al., Unifying the epidemiological and evolutionary dynamics of pathogens. Science, 2004. 303(5656): p. 327-32.

27. Worobey, M., Global HIV/AIDS Medicine. The Origins and Diversification of HIV (2007), ed. M.L. Sande, J. Volberding, P. Greene, WC. 2006: Elsevier, Philadelphia.



30. Pernas, M., et al., A dual superinfection and recombination within HIV-1 subtype B 12 years after primoinfection. J Acquir Immune Defic Syndr, 2006. 42(1): p. 12-8.


32. Shriner, D., et al., Pervasive genomic recombination of HIV-1 in vivo. Genetics, 2004. 167(4): p. 1573-83.


114



115

Chapter 5: Prediction of HIV-1 Recombination Breakpoint Location

Abstract In retroviruses recombination occurs during negative strand DNA synthesis

as a result of switching between genomic RNA templates in a process

known as copy choice. By analyzing 149 previously described recombinant

sequences [1] it was observed that the frequency of breakpoint location was

2.7 times lower than the random expectation within regions of poor

sequence identity. We use this relationship to define a probabilistic model of

in vitro HIV-1 breakpoint prediction. Using previously described parental

sequences, breakpoint locations predicted by the model for the gp120 region

were not significantly different from the in vitro generated breakpoint

locations. The model was then used to generate a probabilistic expectation

for the distribution of breakpoints across the entire HIV-1 genome. It was

observed, with the exception of the regions either side of the envelope gene

and short stretches within the gag and pol genes, that the distribution of

breakpoints did not depart significantly from the distribution obtained from

recombinant strains isolated from individuals during the course of the global

HIV-1 group M pandemic. In genomic regions where the predicted and

observed distributions did not significantly depart from each other, we

propose that a purely probabilistic process is sufficient to explain breakpoint

distribution. Furthermore we propose that the key recombination events in

terms of HIV persistence are those that result in the envelope region being

recombined between strains in a viral population.

116

Introduction

Owing to a number of biological factors including a high mutation rate [2],

rapid viral turnover [3] and a high recombination rate [4], the diversity seen

between strains of HIV-1 group M (HIV Los Almos Sequence database) is

astounding when compared to other rapidly evolving viral genomes such as

influenza [5]. Despite this extensive diversity, when represented on a

phylogenetic tree, HIV-1 group M can be divided into well defined clusters.

Nine of these clusters, termed subtypes - labelled A to D, F to H, J and K [6],

are consistent in their phylogenetic topology in relation to each other

regardless of the section of their genome being compared [7]. The

remaining clusters differ in their phylogenetic relationships in a manner that

is dependant on the region of the genome being examined [8]. This is due to

strains falling into these clusters, termed Circulating Recombinant Forms

(CRF), being comprised of mosaic genomes that originated from different

‘parent’ clusters. Such recombinant forms, along with Unique Recombinant

Forms (URFs), arise following dual infection by two (or more) divergent

strains of the virus [9-13] during (-) strand DNA synthesis in a process known

as copy choice recombination [4, 14, 15].

To date there are 34 CRFs published in the Los Alamos HIV Sequence

database. This number is steadily increasing however with newly emerging

CRFs frequently being discovered. It has been estimated that up to 15% of

viral genomes within an individual who has not been dually infected have

been altered by recombination events that occurred after the initial exposure

to the virus [16]. From the number of recombinant forms contributing to the

global pandemic at both an intra- and inter- host level it is clear that

recombination contributes a great deal to the phylogeny of HIV-1 group M

and thus is a major factor influencing diversity that must be considered when

developing any future vaccine.

117

We have previously observed that breakpoint positions were distributed

across the gp120 region non-randomly following copy choice recombination

within a single cycle system [1]. This non random distribution of breakpoints

has also been observed in vivo across the entire genome [17, 18].

Positioning of breakpoints is believed to be influenced by a number of

mechanistic factors including high sequence identity [1, 15], secondary RNA

structure [19, 20] and the location of runs of identical nucleotides referred to

as homopolymeric stretches (HPS’s) [1, 21]. The latter two are believed to

increase the probability of a breakpoint occurring by stalling the reverse

transcriptase complex during DNA synthesis [22-24], which in turn promotes

the induction of copy choice within regions of high sequence identity.

However, not all recombinant strains are viable.

Preliminary findings [25] suggest that in an in vitro multiple cycle system over

75% of HIV-1 recombinants generated in a dual infection are not able to

replicate. Additionally, if copy choice in vivo does result in a recombinant

genome with the ability to replicate, that mosaic genome must survive

selective pressure from the hosts immune response. Understanding this

selective pressure is vital to the understanding of HIV’s evolution in relation

to drug resistance [26, 27], generation of escape mutants [28, 29], disease

progression [30] and diversity. The ability to compare the locations of

breakpoints present within viable recombinant sequences derived from

established strains contributing to the global pandemic, to the mechanistic

probability of the distribution of breakpoints in the absence of natural

selection, would allow an insight into the selection pressures placed on the

HIV genome.

Here we create a probabilistic model that takes into account sequence

identity between two parental sequences in order to create an expected in

vitro distribution of breakpoints. Sequence identity is accounted for by

disallowing breakpoints to be created directly on mismatches and by

118

reducing the probability of a breakpoint occurring within a window of

constant size anchored to the left of each mismatch. Windows are anchored

on the left hand side of mismatches as reverse transcriptase traverses the

RNA from the 3’ end to the 5’ end during –ve strand DNA synthesis. An

overview of the model is found in figure 5.1 while a detailed description is

presented in the methods section. Breakpoint distributions generated by the

model, using partial gp120 gene sequences from subtypes A and D as

parentals, are compared to the in vitro breakpoint distributions that were

previously generated [1]. No significant difference between the models

predicted distribution and the in vitro distribution were found, confirming that

sequence identity is the major influencing factor in relation to the positioning

of breakpoints.

Known breakpoint locations occurring within full length HIV-1 group M

sequences that have been sampled from the group M pandemic were then

compared to the distributions produced by the model using different

combinations of pairs of HIV-1 group M subtype reference strains as

parental sequences. Across much of the HIV genome, it was observed that

the model predicted distributions did not differ significantly from the in vivo

breakpoint distribution frequencies. In the regions where the distributions did

differ, two areas were identified on either side of the envelope gene where

more than the expected numbers of breakpoints were present within the

global data, suggesting a positive selection for breakpoints within these

regions.

119

Figure 5.1 Prediction of Recombinant Breakpoints

As the RT complex moves from the 3’ end of the donor RNA to the 5’ end it reverse

transcribes the RNA sequence. The RNAse activity of the RT complex is indicated by the

light grey nucleotides on the donor RNA. The nascent –ve DNA strand can be observed

tailing the RT complex. The acceptor RNA strand is aligned alongside the donor strand and

ready for a potential crossover event (known as copy choice) – indicated by the dotted

arrowed line. The dotted boxes indicate windows, of decreased probability of crossover,

that have been anchored on the mismatch at the 3’ end of the potential breakpoint zone.

The probability at each base on the acceptor strand of a crossover occurring is indicated by

p1-3. p1 is the probability of a crossover occurring on a mismatch, p2 is the probability of a

crossover occurring on a base within a window and p3 is the probability of a crossover

occurring away from a mismatch and outside of a window. P1 is set to 0. p2 and p3 depend

on the relationship between the frequency of breakpoints occurring within breakpoint zones

of size 5 or less and the frequency of breakpoints occurring within breakpoints zones greater

than size 5. The plot along the bottom is a representation of each of the probability values

along the sequence. For this stretch of 16 nucleotides the total probability of a crossover

occurring is given by the equation shown.

120

Results

Model Parameters: In figure 5.2, the normalized distribution of breakpoints

falling into breakpoint zones ranging in size from 0 to 25 is represented

(vertical grey bars). A breakpoint zone is the identical region within an

alignment between any two consecutive mismatches. It can be observed

that as the identity between parental sequences increases (larger zone

sizes) so to does the frequency of breakpoint occurrence. As the in vitro

data was limited to 149 recombinant sequences, there are zones where

breakpoints are under represented – most notably zones of size 12, 16, 22

and 25. In zones of size 19 no in vitro breakpoints were observed. The

expected (random) distribution of breakpoints within different zone sizes has

been added to the plot (black horizontal bars). In zone sizes 3, 4 and 5 the

in vitro breakpoint frequencies fall outside 1.645 standard deviations (90%)

of the expected distribution implying, in agreement with [1], that the lack of

sequence identity around these small zones is reducing the occurrence of

breakpoints. The first significant increase in the in vitro breakpoint

frequencies occurs between zones of size 5 and 6. From zone size 6 and

upwards, where there is more identity present between the parental

sequences the majority of in vitro breakpoint frequencies fall within or above

the expected frequencies. The exceptions to this are zones of size 10, 16,

19, 22 and 25 where the lack of data most probably disrupts the general

trend. No breakpoints were observed directly on mismatches in the in vitro

data.

When the per nucleotide frequency of both the in vitro breakpoints and the

random expectations are organized into groups of size 5 and compared (Fig.

5.2 inset), the significant decrease in the frequency of in vitro breakpoints

(vertical bars) within zones of smaller size (1 - 5) in relation to the random

expectations is confirmed. For the next two groups representing zone sizes

from 6 to 15 no significant difference exists between the in vitro breakpoints

121

and the random expectations. In group 16 to 20 the in vitro breakpoints are

significantly below the random expectations but here this is as a result of the

sparsity of the in vitro data within these zones. In the last group the in vitro

breakpoint frequencies are significantly higher than the expected breakpoint

frequencies. This is possibly due to a combination of factors including

increased sequence identity, HPS positioning (the larger the zone size the

more chance of HPS’s being present), secondary RNA structure and the

lower frequency of in vitro breakpoints within smaller zones.

Since the first significant increase in frequency occurs between zone sizes of

5 and 6 the window size parameter for our model is set to 5. To estimate the

reduction in the probability of breakpoint occurring within windows of size 5 a

line of best fit was drawn through the in vitro breakpoint frequency data

falling within zones of size 5 or less. A second line of best fit was drawn

through the in vitro frequency data falling within zones of greater than size 5.

The ratio between the slopes of the two lines indicated that the gradient of

zones of size less (or equal) than 5 was 0.37 times that of zone of size 5 or

greater (Appendix III). This ratio was used to estimate the reduced

probability of a breakpoint occurring within windows of size 5.

122

Figure 5.2 Effects of Sequence Identity on Breakpoint Location

The main plot displays the normalized distribution of in vitro breakpoints falling within zones

ranging from size 0 to 25 (grey vertical bars). The horizontal bars indicate the expected

random distribution of breakpoint within each individual zone. The inset plot shows the per

nucleotide frequency of both the in vitro breakpoints and randomly generated breakpoints

for zones up to size 25 (arranged in groups of 5).

In Vitro Breakpoint Predictions: Figure 5.3a compares the in vitro generated

breakpoint distributions (vertical grey bars) to the random expected

distribution (horizontal black bars). The random probabilities are display a

flat distribution across the lengths of the parental sequences. A chi squared

test indicates that there is a significant difference (p<0.001) between the

randomly expected distributions and the in vitro distributions confirming that

the in vitro breakpoints are not randomly positioned. In figure 5.3b the

probability distributions obtained when taking only mismatches into account

(i.e. the probability of breakpoint occurring on a mismatch is 0) are

observed. The in vitro distributions can be observed to fall within 1.645

standard deviations for 5 out of the 10 predicted frequencies. However

statistically there is still a significant difference between the expected

frequencies within each grouping and the in vitro frequencies (p<0.01). In

123

figure 5.3c, where the probability distributions produced by the full model are

displayed, 8 out of the 10 real breakpoint frequencies fall within the 1.654

standard deviations from the predicted values. There is no significant

difference (p>0.05) between the expected frequencies and the in vitro

frequencies. This suggests that our model works well for predicting in vitro

breakpoint positions. In the regions (200 – 300 and 800 – 900) where the in

vitro data is just within reach of the model predictions other factors

influencing breakpoint positioning that our model does not take into account

may have a stronger influence.

124

Figure 5.3 Predicted Breakpoint Distributions for gp120

In panel A the predicted locations of random breakpoints are displayed as horizontal black

bars. The in vitro distributions are represented by the vertical grey bars. A significant

difference (p<0.001) existed between the simulated breakpoints and the randomly

generated ones. In panel B the predicted distributions obtained by the simulated taking

mismatch locations only into account are represented by the horizontal black bars. Once

again a significant difference (p<0.01) between the predicted frequencies existed with the in

vitro data. The horizontal black bars in panel C represent the predicted frequency values

125

produced when using the full model. No significant difference existed between these values

and the observed frequencies.

Global (in vivo) Breakpoints: In figure 5.4, model predicted breakpoint

distributions for the entire HIV-1 genome are compared to the known in vivo

global breakpoints distribution frequencies. Because of the limited number

of persistent global breakpoints (324) available, as well as the length of the

HIV genome (approx 10, 000bp before gab stripping), the data is arranged

into windows of size 400. The relative region of the genome can be seen

along the x-axis. It can be observed that with the exception of short regions

within gag and pol genes as well as within env all the global frequencies fall

within 1.645 standard deviations from the predicted values. The global

breakpoint frequencies at the end of the gag and pol genes as well as within

the centre of the env gene are significantly below the model predictions

indicating that there is a suppression of breakpoint persistence within these

areas of the genome. At both the start and end of the env gene the global

breakpoint frequencies are above the model predicted frequencies,

suggesting an increased rate of breakpoint persistence.

Figure 5.4 Predicted Breakpoint Distributions the for Entire Genome

126

Model predicted breakpoints locations using full length HIV-1 genomic sequences, as

described in the materials and methods section, are displayed by the horizontal black bars.

The distribution of in vivo breakpoints (as published in [17]) is represented by the vertical

grey bars. Dark grey indicates where the global data falls significantly below the predicted

distribution for and region in question, light grey indicates where the global data falls within

the predicted distributions while white indicates where the global data falls above the model

predicted values. The frequency data has been divided up into window sizes of 400. Along

the x-axis the various genomic regions of the HIV-1 genome are displayed.

Discussion

Previously we observed that the in vitro mechanics of HIV-1 recombination

appear to be determined by quantifiable mechanical characteristics of the

parental sequences involved [1]. Here we demonstrate that sequence

identity is a major influencing factor on in vitro breakpoint positioning. We

use this to define a probabilistic model that can accurately predict the

location of in vitro recombinant breakpoints. From our initial data (Fig. 5.2)

we calculated that within breakpoint zones of size 5 nucleotides or less there

were 2.7 times fewer breakpoints than within all other zone sizes. Using this

ratio within our model, frequency predictions were obtained that show no

significant differences (p>0.05) to in vitro generated breakpoint distributions

across the partial gp120 genomic region (Fig. 5.3c). These predictions could

be further improved by incorporating other known influences on breakpoint

positioning, such as RNA secondary structure and HPS location, into the

mechanics of the model presented. This is a viable option for future

research as, like that of sequence identity, the influence of RNA secondary

structure and HPS’s on the positioning of breakpoints in vitro is mechanistic

[31].

Our model works well for predicting the locations of in vitro recombinant

breakpoints. However it cannot be directly applied to the group M global

pandemic. This is because, within an individual host, a newly generated

127

mosaic genome will not necessarily be viable [25]. Natural selection will

further result in many recombinant sequences not being able to contribute to

the pandemic in any significant way thus influencing the persistence of

observed in vivo breakpoints. The production of ‘fit’ recombinant sequences

in vivo goes beyond simple mechanical factors determined by the parental

sequences. For both in vitro breakpoints and model predictions, such

‘fitness’ is not tested. For our probabilistic model to be applied to the global

pandemic, regions where natural selection has a strong influence on

breakpoint persistence must be identified and accounted for.

Conversely our model can be very useful in identifying such regions. With a

plausible method of predicting recombinant breakpoints in the absence of

natural selection, a comparison can be made with breakpoints present within

the globally sampled data. Areas that significantly deviate from the model

predicted breakpoint distributions are potentially those where the effects of

selection pressures due to the host immune response are strongest. This

can be observed in figure 5.4, a representation of both real and predicted

breakpoints across the entire HIV-1 genome where, with the exception of the

env region and a short stretch in each of the gag and pol regions, all the

group M in vivo breakpoint location frequencies fall inside 1.645 standard

deviations from the predicted breakpoint frequencies.

The exceptions, with particular emphasis on the env gene, fall within regions

known to be under high selective pressure [32]. By comparison to our model

predictions it can be observed that on either side of the env gene there is a

severe under prediction of in vivo breakpoints. This indicates a tendency for

the in vivo shuttling of the entire envelope region as an intact unit between

strains. Whether this is a result of coincidental or sequential recombination

events, it indicates that selection is promoting env’s transfer from one

genetic background into another effectively as an integral cassette. This

must be directly related to the envelope protein’s functional significance in

128

relation to viral fitness determinants, its propensity to be subject to high

levels of positive selection, and the importance of the action of the immune

response on HIV’s envelope gene [33, 34].

The paucity of recombination breakpoints within the envelope gene itself (but

also in parts of gag and pol) indicates that selection is suppressing

recombination within these regions, which is presumably due to inter-

dependencies within gene regions, for example, for the maintenance of

structural and functional integrity in the context of high viral diversity. Within

the envelope region, such inter-dependencies could be partially related to

the involvement in coreceptor binding of the V1, V2 V3 and C4 regions [35,

36].

That the frequency of recombination in other genomic regions does

not depart significantly from the model predicted expectation (based simply

on the sequence identity within the parental sequences) indicates that purely

mechanistic processes are sufficient to explain breakpoint locations within

these regions. More importantly, in such regions the role of recombination in

relation to immune evasion seems to be limited. Our results emphasise that

detailed mapping of individual recombinant structures, while important,

should be considered in the context of a probabilistic expectation generated

by purely mechanistic processes. The recombination breakpoints that are of

most importance to understanding the HIV/AIDS pandemic are those that

are over- represented against this expectation.

Here we have presented how sequence identity can be used to create a

model that accurately simulates recombination in the absence of natural

selection. We have discussed the usefulness of such a model and how it

needs to be modified before it can be directly applied to the group M

pandemic. These modifications are very tractable and the result would be a

model that could robustly predict the dynamic future of the emerging group

129

M recombinant strains. This would have very far reaching implications in

terms of any future prophylactic strategies in terms of anticipating and

preventing the continuous emergence of future virulent clusters.

Methods In Vitro Recombinant Breakpoints

The in vitro recombinant gp120 sequences used are described in [1]. In

summary five parental sequences were used. Two belonged to subtype A

(115A and 120A), while the other three belonged to subtype D (89D, 122D

and 126D). From these isolates the seven pairs of parentals were:

126D/120A, 115A/126D, 115A/89D, 120A/89D, 89D/120A, 126D/122D and

115A/120A. The number of recombinant sequences generated for each

pair, using a single cycle infection system, was 39, 23, 25, 21, 22, 23 and 22

respectively.

In Vitro Breakpoint Distributions

Location Distributions: The frequency of breakpoints occurring at different

locations across the recombinant sequences was previously obtained [1].

For each of the parental pairs these location frequencies were organized into

groups of 100 (e.g. 1 – 100, 101 – 200 etc). Corresponding locations were

pooled across each of the parental pairs.

Distribution within Breakpoint Zones of Different Sizes: The data from each

parental pair was pooled together and the frequency of breakpoints falling

within breakpoint zones of particular sizes was calculated. Normalization

was done for a zone of particular size by dividing the frequency of breakpoint

occurrence by the total number of potential breakpoint zones of that size.

The maximum zone size used was 25 as for larger zone sizes the limited

number and length of sequences meant that data was sparse. The per

130

nucleotide frequency of breakpoints occurring within zones of each size up

to size 25 was calculated.

Predicted Recombinant Breakpoints

Three independent methods were used to generate predicted breakpoint

frequencies for the parentals described. These were (i) Random breakpoint

prediction, (ii) Breakpoint prediction based on mismatch locations only and

(iii) breakpoint prediction using our full model. Random breakpoint prediction: The probability of creating a breakpoint, pb,

on any site within the parental alignment is given by

€

pb =1n

, (1)

where n is the length of the alignment.

Breakpoint Prediction Based on Mismatch Locations: To take sequence

identity into account, the probability of a breakpoint being located on a

mismatch is reduced to zero. The probability of creating a breakpoint on any

site that is not a mismatch, pb', becomes

€

pb '=1

n -m, (2)

where m is the number of mismatches in the alignment.

Breakpoint prediction based on full model: In the full model (Fig. 5.1) there

are three different categories of site. These are: (i) sites located on

mismatches, (ii) sites located within windows of size five nucleotides

downstream of a mismatch and (iii) sites located neither on a mismatch nor

within a window. At each type of site, the probability of a breakpoint

occurring is given by p1, p2 or p3 respectively. Across the full alignment, the

sum of probabilities over all sites is 1. The model can therefore be

summarized as

€

mp1 + wp2 + (n - (m + w))p3 =1, (3)

131

where w is the number of nucleotides falling within a window. Since

breakpoints are not allowed on mismatches, p1 is set to zero. We further

define the ratio

€

α =p2p3

(4)

to represent the factor by which the probability of recombination is reduced

within a window. From equation (3), the model parameters can therefore be

expressed as

)- w(1- m- n p2 α

α= (5)

and

)- w(1- m- n1 p3 α

= . (6)

To estimate the value of α, a line of best fit was drawn through the

normalised in vitro breakpoint frequency data for zones of size 5 or less. A

second line of best fit was drawn through the data for zones of greater than

size 5. Since the gradient of such a line corresponds to the average

recombination frequency associated with a single nucleotide falling in a

specific category (window / non-window), the ratio of the gradients can be

used to give a value of α = 0.37 (Appendix III).

Random Breakpoint Distributions

For each site in a parental alignment the random probability of a breakpoint

occurring was given by equation (1). The parentals used were the same as

those described for the in vitro data.

Location Distributions: For each parental pair sites were organized into

groups of 100 for which the individual probabilities were summed (e.g. 1 –

100, 101 – 200 etc). These summed probabilities were weighted according

to the number of in vitro breakpoints that were present for the same parental

132

pair. Corresponding groups across each of the parentals were then summed

to give the expected distribution.

Distribution within Breakpoint Zones of Different Sizes: The total number of

sites falling into breakpoint zones of a particular size was calculated by

multiplying the size by the total number of occurrences of the zone. The

random probability of a breakpoint falling on a site within any zone was then

calculated by dividing 1 by the total number of sites within all potential zones.

For potential zones of a particular size the expected probability of a

breakpoint occurring was calculated by multiplying the probability of a

breakpoint falling on a site within any zone by the length of the potential

zone. The expected probability was then weighted accordingly by

multiplying by the total number of in vitro breakpoints observed within that

zone size.

Breakpoint Distributions incorporating Mismatch Influence

This is similar to the “Location Distributions” from the previous section except

that the probability of a breakpoint occurring on a mismatch was 0. The

probabilities for the remaining sites were calculated using equation (2).

Model Breakpoint distributions

Similarly to above - the probability of a breakpoint occurring on a mismatch

was 0. Here however equations (5) and (6) were used to calculate the

probabilities across the remaining sites.

Global CRFs

In Vivo Breakpoint distributions: 324 breakpoints from circulating CRFs, over

the full length of the HIV-1 genome, with known locations and parental

subtypes [17] were used to obtain the distribution of breakpoints across the

genome in groups of 400.

133

Global CRF Model Expectations: The probabilities of breakpoints occurring

at individual sites were calculated from equations (5) and (6). The

probability of a breakpoint occurring on a site where there was a mismatch

between the two parental sequences was 0. The parentals used were the

global group M reference strains that represented the various subtypes seen

to be present within the CRFs [17]. The accession numbers of the parental

strains used were – AF069670 (subtype A), K03455 (subtype B), AF067155

(subtype C), U88824 (subtype D), AF005494 (subtype F), AF061641

(subtype G), AF190128 (subtype H), AF082394 (subtype J), and AJ249235

(subtype K). For each parental pair sites were organized into groups of 400

for which the individual probabilities were summed (e.g. 200 – 600, 601 –

1001 etc). These summed probabilities were weighted according to the

number of in vivo breakpoints that were observed for the same parental pair.

These numbers were: GH (5), HJ (1), JG (10), GK (6), AB (2), DF (7), BF

(66), JA (8), AG (35), DA (83), DC (19), CG (5), CA (50), AK (4), FK (7), CB

(14) and GB (2). Corresponding groups across each of the parentals were

then summed to give the expected distribution that could be directly

compared to the in vivo data.

134

References


2. Malim, M.H. and M. Emerman, HIV-1 sequence variation: drift, shift, and attenuation. Cell, 2001. 104(4): p. 469-72.







9. Boisier, P., et al., Nationwide HIV prevalence survey in general population in Niger. Trop Med Int Health, 2004. 9(11): p. 1161-6.

10. Diaz, R.S., et al., Dual human immunodeficiency virus type 1 infection and recombination in a dually exposed transfusion recipient. The Transfusion Safety Study Group. J Virol, 1995. 69(6): p. 3273-81.


12. Salminen, M.O., et al., Evolution and probable transmission of intersubtype recombinant human immunodeficiency virus type 1 in a Zambian couple. J Virol, 1997. 71(4): p. 2647-55.

135

13. Yerly, S., et al., HIV-1 co/super-infection in intravenous drug users. Aids, 2004. 18(10): p. 1413-21.

14. Yu, H., et al., The nature of human immunodeficiency virus type 1 strand transfers. J Biol Chem, 1998. 273(43): p. 28384-91.

15. Zhang, J. and H.M. Temin, Retrovirus recombination depends on the length of sequence identity and is not error prone. J Virol, 1994. 68(4): p. 2409-14.


17. Fan, J., M. Negroni, and D.L. Robertson, The distribution of HIV-1 recombination breakpoints. Infect Genet Evol, 2007. 7(6): p. 717-23.

18. Magiorkinis, G., et al., In vivo characteristics of human immunodeficiency virus type 1 intersubtype recombination: determination of hot spots and correlation with sequence similarity. J Gen Virol, 2003. 84(Pt 10): p. 2715-22.

19. Galetto, R., et al., Dissection of a circumscribed recombination hot spot in HIV-1 after a single infectious cycle. J Biol Chem, 2006. 281(5): p. 2711-20.

20. Moumen, A., et al., The HIV-1 repeated sequence R as a robust hot-spot for copy-choice recombination. Nucleic Acids Res, 2001. 29(18): p. 3814-21.

21. Klarmann, G.J., C.A. Schauber, and B.D. Preston, Template-directed pausing of DNA synthesis by HIV-1 reverse transcriptase during polymerization of HIV-1 sequences in vitro. J Biol Chem, 1993. 268(13): p. 9793-802.

22. Derebail, S.S. and J.J. DeStefano, Mechanistic analysis of pause site-dependent and -independent recombinogenic strand transfer from structurally diverse regions of the HIV genome. J Biol Chem, 2004. 279(46): p. 47446-54.

136

23. Lanciault, C. and J.J. Champoux, Pausing during reverse transcription increases the rate of retroviral recombination. J Virol, 2006. 80(5): p. 2483-94.

24. Roda, R.H., et al., Strand transfer occurs in retroviruses by a pause-initiated two-step mechanism. J Biol Chem, 2002. 277(49): p. 46900-11.

25. Baird, H.A., et al., Influence of sequence identity and unique breakpoints on the frequency of intersubtype HIV-1 recombination. Retrovirology, 2006. 3: p. 91.

26. Margot, N.A., J.M. Waters, and M.D. Miller, In vitro human immunodeficiency virus type 1 resistance selections with combinations of tenofovir and emtricitabine or abacavir and lamivudine. Antimicrob Agents Chemother, 2006. 50(12): p. 4087-95.

27. Perno, C.F., V. Svicher, and F. Ceccherini-Silberstein, Novel drug resistance mutations in HIV: recognition and clinical relevance. AIDS Rev, 2006. 8(4): p. 179-90.

28. Guillon, C., et al., Evidence for CTL-mediated selection of Tat and Rev mutants after the onset of the asymptomatic period during HIV type 1 infection. AIDS Res Hum Retroviruses, 2006. 22(12): p. 1283-92.

29. Pillay, T., et al., Unique acquisition of cytotoxic T-lymphocyte escape mutants in infant human immunodeficiency virus type 1 infection. J Virol, 2005. 79(18): p. 12100-5.


31. Konstantinova, P., et al., Hairpin-induced tRNA-mediated (HITME) recombination in HIV-1. Nucleic Acids Res, 2006. 34(8): p. 2206-18.

32. Seibert, S.A., et al., Natural selection on the gag, pol, and env genes of human immunodeficiency virus 1 (HIV-1). Mol Biol Evol, 1995. 12(5): p. 803-13.

137


34. Marozsan, A.J., et al., Differences in the fitness of two diverse wild-type human immunodeficiency virus type 1 isolates are related to the efficiency of cell binding and entry. J Virol, 2005. 79(11): p. 7121-34.


36. Hoffman, N.G., et al., Variability in the human immunodeficiency virus type 1 gp120 Env protein linked to phenotype-associated changes in the V3 loop. J Virol, 2002. 76(8): p. 3852-64.

138

Chapter 6: A Strategy for Identifying Significant HIV Sequence Diversity Using Structural Constraints

Abstract We present a strategy for identifying meaningful sequence diversity to aid

HIV vaccine design. The approach is based on the identification of likely

amino acid replacements in the context of physical constraints imposed by

three-dimensional protein structure. We apply our model to HIV-1 Gag and

demonstrate that prediction of amino acids present in 1,100 group M

sequences is statistically significant. The structure-based model is, thus,

accurately identifying the important diversity at specific sites. Our approach

has a tendency to under-predict residues, mainly due to the presence of

improbable amino acids in sequence data; residues in the observed data

that are of limited relevance since they are either present in non-viable

viruses or result from sequencing artifacts. These residues can be identified

because they lack shape complementarity with the rest of the protein. Were

the virus to synthesize proteins with these replacements, an unfolded non-

functional protein would likely result. These improbable replacements can,

thus, be considered functionally and evolutionary irrelevant and so do not

need to be considered in vaccine design. To assess the effect of removing

these unlikely residues we use a novel algorithm to quantify the number of

optimised sequence-constructs required to “cover” observed HIV-1 P17

diversity. Applying this strategy to two large HIV-1 data sets (>1,000

sequences) we achieve a marked improvement in mean coverage estimates.

For example, the number of constructs required for 90% coverage of P17

alignments decreases from 133 to 71 for group M and for subtype B from 60

to 30. Thus, removal of improbable amino acids is a rational approach to

reducing the diversity present in HIV sequence alignments. Incorporation of

further functional and immunological constraints will permit prediction of viral

evolution that is both significant to the immune response and increasingly

accurate.

139

Introduction The greatest impediment to producing an HIV vaccine is the high rate of viral

evolution. This is a consequence of HIV’s high rate of mutation [1],

propensity to recombine [2, 3], high rate of viral turnover [4, 5] and the

actions of the immune response promoting positive selection [6-8]. The

ability of HIV to change rapidly leads to high levels of sequence diversity

both within and between infected individuals, enabling a persistent infection

to be maintained despite the actions of the immune response [9-11]. To be

effective in such a situation, a preventative vaccine needs to be protective in

the face of the many divergent viral variants present in the infected human

population. Additionally a therapeutic vaccine needs to protect against the

generation of new variants in the context of one individual’s on-going

infection. For this reason extreme sequence diversity severely complicates

the choice of candidates for a potential HIV vaccine.

A significant body of work exists on the appropriate consideration of HIV

variation in vaccine design strategies and various methods have been

discussed. These include explicit consideration of HIV-1 group M’s

evolutionary history to choose “central” candidates, for example, ancestral

and consensus sequences, and “centre of tree” sequences based on single

or multiple subtypes [12-15]. These approaches assume that all of the

diversity observed in the sequence data sampled from the pandemic is

meaningful. However, a subset of HIV’s diversity will be of limited

importance and so does not need to be considered in vaccine preparations,

be it a preventative or therapeutic approach. This “irrelevant” variation will

be either present in non-viable viruses or results from sequencing artefacts.

Here we present a strategy for identifying the non-significant changes.

These residues can be identified because they lack shape complementarity

with the rest of the protein. When we model these replacements, there are

substantial unfeasible overlaps between atoms (see Fig. 6.1A for an

140

example). Were the virus to synthesize proteins with these replacements an

unfolded and non-functional protein would likely result. These residues will

be present in real sequence data but predicted to be highly improbable by

our model. If real they would produce non-functional viruses, and so will not

contribute to escape from the immune response. Thus, they are

evolutionary irrelevant.

Figure 6.1 Predicting HIV Evolution at Individual Sites

(A) Matrix protein P17’s structure and illustration of the “goodness-of-fit” aspect of the

model. Each site in P17 has every amino acid substituted computationally (panel B). In the

inset, position 85 (leucine) is replaced with methionine (dark blue). In this conformation the

methionine has substantial van der Waals overlaps, as illustrated by the pink and yellow

spikes, and so would be considered an improbable replacement. (B) The algorithm used for

the predictive model. All low energy side chain conformations (“rotamers”) are tested, and

the molecular interactions between this residue and the rest of the protein are calculated

with PROBE to determine the “goodness-of-fit” at a specific site. Residues that pass this

test are subsequently filtered using a substitution matrix to determine whether the

substitution/replacement is likely. See methods for further details.

Our predictive strategy can be combined with “coverage” approaches which

involve the identification of multiple candidates for inclusion in artificial

sequence-constructs and a potential polyvalent (or “cocktail”) vaccine [16,

141

17]. The objective is to build mosaic sequences that will optimally cover the

diversity of HIV sequences encountered by the immune system allowing

recognition of the whole diverse set of viruses, or at the very least viruses

corresponding to a particular subtype of epidemiological significance. There

are, however, practical problems with including too many sequence-

constructs in the cocktail [15]. The challenge, therefore, is to describe the

maximum amount of observed HIV sequence diversity with the fewest

number of artificial sequence-constructs. This involves the identification of

multiple vaccine antigen-constructs that are optimised to maximise coverage

of circulating variants and that include high-frequency epitopes [18, 19].

As a proof of concept we apply our strategy to HIV-1’s Matrix protein, P17.

P17 is processed from the Gag polyprotein, is associated with the inner viral

membrane, and has a number of functions essential for viral assembly and

entry into the infected cell [20]. We have chosen this protein primarily

because of its known immunogenic importance; for example see Iversen et

al. [21]. We demonstrate that (i) amino acid usage at individual sites is

significantly constrained, (ii) our model accurately predicts real viral diversity

and (iii) removal of improbable amino acids leads to a marked increase in

coverage estimates. To increase the accuracy of the predictions of viral

evolution further structural and functional information can be incorporated

into the model.

Results Predicting viral evolution

We first use our model (Fig. 6.1B) to predict the probable and improbable

replacements at each site in HIV-1’s Matrix protein, P17 (Fig. 6.2A). The

improbable amino acids are of two types.

143

Figure 6.2 True Positives, False Positives and True Negatives

(A) Venn diagram showing the relationship between observed and predicted amino acids at

individual sites in a sequence alignment. The intersect between the predicted and observed

amino acids corresponds to our correct predictions (true positives, TP; blue). Improbable

amino acids are those that have not been observed to be used by HIV (true negatives, TN;

green) or have been observed in sequence data but are not predicted to be important (false

negatives, FN; red). “Other constraints” accounts for those predictions not observed in the

sequence data (false positives, FP; yellow); if these constraints were included in the model

this over-predicting would not occur. Note, the areas are not to scale and are for illustrative

purposes only. (B) The number of residues predicted by the model at each site in P17 (total

bar height). The number of these that are observed in the alignment (TPs) and the number

predicted but not observed (FPs) are shown. The black squares denote sites that are

significantly different from random (P<0.05). (C) Bar chart depicting the frequencies of

residues predicted (black) and observed (light grey) in P17. The number observed per site

is taken from the alignment of 1,091 unique HIV-1 group M sequences. (D) The number of

residues observed (total bar height). The proportion of these that are predicted (TPs) and

the proportion observed but not predicted (FPs) are shown. Stars indicate sites inferred to

be under the influence of positive selection [7]. The black squares denote sites that are

significantly different from random (P<0.05). (E) The number of residues not observed in

the alignment (total bar height). The number of these not predicted and not observed (TNs),

and the number predicted but not observed (FPs) are shown. The black squares denote

sites that are significantly different from random (P<0.05).

Firstly, residues not observed to have been ever used by HIV. Secondly, the

residues observed in the available sequence data that are evolutionarily

irrelevant because they give rise to non-functional viruses. In all we find 110

positions out of 115 are predicted to change, i.e., exhibit more than one

probable residue per site (Appendix IV and total bar height in Fig. 6.2B).

The distribution of these predicted replacements is shown in Fig. 6.2C (black

bars). This strikingly demonstrates the high degree of constraint at individual

sites such that the mean number of predicted amino acids per site is only

five, with the maximum being nine. Note, the consensus residue was not

predicted for only six sites (Fig. 6.3). These rare erroneous predictions are

presumably the result of the P17 structure having been solved for one

144

sequence only, a problem that would be resolved with the availability of more

P17 structures.

Figure 6.3 Amino Acid Frequencies within P17

The amino acid frequencies of all true positives (blue) versus all false negatives (red) at

each site in P17.

To determine the accuracy of the predictions we compare them to the

number of residues observed at each site in the group M P17 sequence

alignment (indicated by total bar height in Fig. 6.2D). Of the 115 sites in the

alignment, only one site was 100% conserved within all the sequences. The

mean number of amino acids per site was observed to be seven while the

maximum was 15. Comparing the distribution of these observed residues to

those predicted (Fig 6.2B) we find they are not significantly different

(P=0.052, Mann-Whitney test). The result is marginal as there is a clear

tendency for the real data to include more amino acids than predicted (Fig.

6.2C and D). If this under-prediction reflects the presence of mostly

irrelevant amino acids and not the failing of the model, then the residues not

predicted (FNs) should occur at significantly lower frequencies than those

that are predicted (TPs). We find this is the case with FNs occurring at

significantly lower frequencies, 2% on average versus 25% for TPs (Fig. 6.3;

P < 0.0001, Wilcoxon rank-sum test). Of the 207 amino acids that only occur

once in the group M data, 70% are FNs. In addition, six of the eight

positively selected sites that have been identified in the P17 region [7]

correspond significantly to sites with high numbers of observed amino acids

145

(Fig. 6.2D, P < 0.05, Wilcoxon rank-sum test). We under-predict at these

sites due, presumably, to co-variation effects not accounted for by the model

(see Discussion).

To investigate the statistical significance of our predictions we calculate

positive predictive values, sensitivity and specificity (Appendix V), and

compare these to 10,000 random simulations. A positive predictive value

quantifies how often our predictions are correct, i.e., present in the real data

(TP), out of all the predicted residues at each site (blue bars relative to total

bar height in Fig. 6.2B) compared with random predictions of the same

number of amino acids at each site. The mean of each random simulation

was never greater than the mean positive predictive value (0.71; P < 0.0001)

and these values were statistically significant at 58% of the sites (P < 0.05,

Appendix V and indicated with black squares in Fig. 6.2B).

In terms of the sensitivity of our method, i.e., whether we predict all the

amino acids at a site, we have established the model under-predicts at the

majority of sites, which will lead to lowered sensitivity. This is certainly the

case as indicated by blue bars relative to total bar height in Fig. 6.2D.

Nonetheless the mean of each random simulation was never greater than

the mean sensitivity (0.61; P < 0.0001) and these values were statistically

significant at 63% of sites (P < 0.05, Appendix V and indicated with black

squares in Fig. 6.2E). In terms of specificity, i.e., how often we predict a

residue not to be present, the results were also significant (mean = 0.89; P <

0.0001) with 65% of sites being statistically significant (P < 0.05, Appendix V

and indicated with black squares in Fig. 6.2E). This confirms that our

predictions of probable amino acid replacements are highly accurate as the

model is making very few incorrect predictions.

146

Using predictions to inform vaccine design

In order to quantify coverage requirements, we first determine the

distribution of the number of unique nine-mers present in HIV-1 sequences

across P17 for the global group M and for subtype B data using large

sequence alignments: 1,091 and 1,405 unique sequences, respectively (Fig.

6.4A and B, respectively). We focus on nine-residue fragments because

there has been much discussion of the possibility of a T-cell based vaccine

[22]. A potential coverage-approach is, thus, to produce a polyvalent T-cell

based vaccine comprised of optimised nine-residue peptide fragments

(“nine-mers”; the average length of relevant epitopes) that occur in the

observed HIV-1 sequences [16, 17]. The graphs (Fig. 6.4A and B) show the

highly heterogeneous nature of all nine-mers across P17. Next we exclude

the improbable amino acid residues (specifically FNs) from the P17

alignments, as these do not need to be considered in coverage estimates.

This dramatically reduces the numbers of nine-mer combinations (Fig. 6.4A

and B) with the average falling from 142 to 98 for group M and from 109 to

73 for subtype B. So as to be conservative, in all analysis, all residues were

retained at the five sites for which the consensus residue was not predicted

and the sites identified as being positively selected because the model is

clearly not predicting accurately at these positions.

To determine the impact of removing improbable amino acids in this way, a

novel algorithm (Fig. 6.5) that selects high-frequency nine-mers in a

sequence alignment was developed. The algorithm optimizes the process of

sequence construction such that the initial constructs provide the most

coverage. Unlike previous approaches, a given number of sequence-

constructs guarantees a certain percentage of cover. For example, to

achieve 75% coverage of the group M sequences, 26 sequence constructs

are required (Fig. 6.4C).

147

Figure 6.4 Distribution of Nine-mers in Relation to Coverage

The distribution of nine-mers across P17 alignments for (A) 1,091 unique HIV-1 group M

and (B) 1,405 unique subtype B sequences for all data (blue) and after removal of

improbable amino acids (orange). The coverage provided by sequence-constructs for the

same (C) HIV-1 group M and (D) subtype B alignments for: all data/no model (blue), after

removal of improbable amino acids as determined by our model for both all of P17 (orange)

and, for subtype B only, an optimal epitope region [23] from sites 11 to 44 (green). Note, the

latter CTL rich region corresponds to sites 8 to 41 in our alignment (transparent green box).

148

Figure 6.5 Generation of Optimised Sequence Constructs

Schematic representation of the algorithm used to make the optimised sequence-constructs.

Straight arrows represent transitions between the steps of the algorithm. Circular arrows

indicate where multiple iterations of a step occur. Boxes represent list objects.

When we apply our model to exclude improbable residues we require only

16 sequence constructs, while for 90% coverage the number of sequences

falls from 133 to 71. Focusing just on subtype B (Fig. 6.4D), the number of

optimised sequence-constructs is even lower still: 30 as opposed to 60 to

give 90% coverage with and without the model, respectively. Applying our

model to a CTL dense region in P17 (Fig. 6.4B), for 75% and 90% coverage

only 3 and 18 sequence-constructs are required (Fig. 6.4D), while for 95 and

98 coverage 41 and 95 sequence-constructs are required, respectively.

149

Although these numbers of sequence-constructs are still high for current

vaccine technology, these results demonstrate how HIV’s seemingly

enormous diversity can be reduced to potentially manageable proportions by

focusing on the prediction of probable versus improbable evolution at

individual sites.

Application of the replacement model, thus, leads to a dramatic decrease in

the number of sequence constructs required at all levels of coverage and

performs markedly better than consensus sequences (dashed lines in Fig.

6.4C and D). The reduction in the number of nine-mers required for a given

level of coverage, after use of the model, is approximately equivalent to

removing those amino acid residues that occur at a frequency of 1% and

0.5% for the group M and subtype B alignments, respectively (Fig. 6.6). A

straight-forward amino acid frequency cut-off will inevitably lead to a

reduction in the number of nine-mers required because it artificially reduces

the sequence diversity from that actually observed in the HIV population.

However, although there is a significant tendency for the identified

improbable residues (FNs) to be at low frequencies a minority are not (19%

exceed a frequency of 1% in the group M alignment), while a high proportion

of probable residues (TPs) are at low frequencies (47% are at a frequency

lower than 1% in the group M alignment). Thus, although in terms of

coverage the model and specific percentage threshold are similar, the model

reduces diversity in a justifiable manner.

150

Figure 6.6 Sequence Coverage

The coverage provided by sequence-constructs for (A) the HIV-1 group M and (B) subtype B

alignments for: all data/no model (dark blue), after removal of improbable amino acids as

determined by our model (orange), removal of amino acids occurring at a frequency of less

than 0.5% (green), 1% (purple) and 2% (light blue). The mean coverage for a random

control is shown (black); the vertical error bars include 95% of the coverage scores. The

dashed black line indicates coverage provided by the alignment’s consensus sequence.

Note, these are the same distributions as in Fig. 6.4C and D with the inclusion of percentage

cut-offs.

Discussion Our model, despite being relatively simple, accurately predicts the subset of

amino acids that are present in the majority of P17’s variable sites (Fig. 6.2).

The approach works because there are strong structural constraints acting

on viral proteins such that the amino acids that can occur at individual sites

are restricted. This link between sequence and structure imposes physical

restrictions both on which sites will change and the nature of the change that

can occur; consequentially structure affects the distribution of amino acids

observed. Inclusion of additional constraints at functional sites [20], the

requirement to maintain binding interfaces [24], and intra- and inter-

molecular interactions [25] will place further limits on change. The model

can also be improved by the introduction of structural plasticity [26, 27]. The

quantification of these additional constraints on viral evolution will permit,

151

with even greater accuracy than achieved here, the prediction of the

evolutionary trajectory of HIV at specific sites.

Protein structural and functional constraints determine whether an amino

acid altering change is neutral (of little or no fitness-cost), prone to purifying

selection (a high cost-of-fitness to the virus in terms of structure and/or

function) or positively selected (immediately beneficial to viral fitness). As a

consequence the so-called high “cost-of-fitness” residue replacements

associated with escape mutations [28] are likely to be improbable according

to our model (Fig. 6.2A). Thus, at these highly specific sites our results are

very probably underestimates; although some of these may correspond to

the positively selected sites for which all residues were retained in our

analysis. Nonetheless our approach based on prediction of probable amino

acid replacements, in the context of protein structure, is the first attempt to

understand the limits of viral evolution. Crucially, the high cost-of-fitness

mutations associated with immune escape require compensatory mutation(s)

[29-31]. Indeed it is the occurrence of compensatory mutation that changes

an amino acid’s replacement from improbable to probable. Thus,

understanding in greater detail the nature of compensatory co-variation both

in terms of intra- and inter-molecular interactions will permit the design of

increasingly sophisticated and representative sequence-constructs.

Ultimately a detailed understanding of the limits of HIV’s ability to change will

permit the accurate deduction of the extent of HIV variation that any vaccine

must protect against. The key to success in realising a viable vaccine will be

the identification of low numbers of optimised (and immunogenic) sequence-

constructs, while simultaneously increasing percentage coverage. The

difficulty in achieving this is dramatically highlighted by the non-linear

relationship between increasing percentage coverage and numbers of

sequence-constructs (Fig. 6.4C and D). This means that coverage of 75% of

the diversity in a large subtype B sequence alignment can be provided by

152

just six optimised sequence-constructs but further increases in coverage

require disproportionately more artificial sequences, such that 50 constructs

are required for 93% coverage and 100 for 97% coverage. This is because

achievement of very high levels must provide coverage of the more variable

regions of P17 (Fig. 6.4A and B).

In addition to dealing with HIV’s diversity, and probably the most difficult task

in vaccine design, the model-based sequence-constructs will have to be

further improved by manipulating them to be immunogenic [14, 18, 19, 32,

33]. Notwithstanding complexities such as immunodominance and cross-

reactivity [22, 33, 34], this will require the identification of a subset of mosaic-

construct sequences that are optimised for processing and immune

recognition. The importance of our approach is that it is a strategy for

reducing the amount of HIV variation that will need to be considered in this

endeavor. Previous approaches, because they ignore structural and

functional constraints, will have over-estimated the diversity that the immune

response has to deal with.

In conclusion, we can make rational predictions about significant viral

variation based on what are possible, probable, and importantly improbable

amino acid replacements at specific locations. Thus, a predictive strategy is

a more effective paradigm for vaccine design than trying to cope with all

global HIV-1 sequence diversity. This is primarily because we are focussing

on the requirement of both infecting strains and escape mutants to be viable,

as opposed to the often error-laden and functionally, let alone evolutionary,

irrelevant sequence data that has been sampled from the HIV/AIDS

pandemic. Rational vaccine design, if it is to succeed, must pay greater

attention to quantifying the subset of HIV variation that contributes to on-

going evolution. It is this diversity that will include the variants of

immunological significance in the future.

153

Methods The model

The protein structure 1HIW was selected from the Protein Data Bank [35] as

being representative of HIV-1’s Matrix protein, P17 [36]. Hydrogen atoms

were added to the structure in optimum positions using REDUCE [37]. Each

position within the protein structure was analyzed by replacement of the

existing side-chain with all other amino acid side-chains using PREKIN [38].

The goodness-of-fit of these substituted amino acids was then assessed

[39]. The side-chains of each amino acid were rotated in steps of 5° around

each of their χ angles to known rotamer boundaries using a rotamer library

[40] in order to optimise efficiency and minimize computational time. After

each rotation step, all-atom contact measurements were carried out using

PROBE [41] to assess how well the residue would fit within the local

structure of the molecule in the given conformation. PROBE uses the rolling

probe algorithm [42] to recognize regions where steric clashes between

atoms occur. Once this had been done for all amino acids at a particular

position, those residues were selected with PROBE score > -1. These were

amino acids with at least one rotameric conformation that did not cause

steric clashes within the existing local protein structure. We predict that

replacements of these residues into the structure would not cause local

structural disruption and are, therefore, more likely to occur than

replacements requiring the local structural environment to shift in order to

accommodate them.

Following the goodness-of-fit analysis, replacement tables were used to

further filter the predicted amino acids at each position in the structure.

Replacement tables were used in order to account for propensity issues

such as the likelihood of buried charges and exposed hydrophobic side-

chains, which would not be taken into account in predictions using

goodness-of-fit alone. The PAM10 [43] replacement matrix was selected to

154

account for the degree of sequence diversity within the protein family. The

likelihood of each of the amino acids occurring at this position were

assessed using the replacement table and an empirically derived cut-off of

0.005 was used to reduce the predicted set of amino acids further.

Amino acids which exceeded the cut-offs for goodness-of-fit and

replacement likelihood formed our predicted set of “probable” residues for

each site in the protein structure (Fig. 6.1B). These are amino acids, which if

substituted into the protein, would cause minimal structural disruption.

Amino acids not predicted by the model are those that can be discounted

from inclusion in coverage analysis. We do not consider false positive

predictions in coverage estimates (amino acids predicted but not observed),

as these are presumed to be residues that are absent from the real data as a

result of other constraints not quantified by the model (Fig. 6.2A).

Data sets

All HIV-1 sequences for which a near full-length genome is available were

downloaded from the LANL HIV Sequence Database (www.hiv.lanl.gov).

This resulted in a data set of 1,179 aligned sequences after strains with

identical names were removed. This data set included 1,091 unique strains.

Sequences corresponding to P17 were extracted, translated to amino acids

and aligned using Muscle [44]. For subtype B all available P17 sequences

were downloaded from the HIV Sequence Database. This resulted in a data

set of 1,405 unique strains. These were translated to amino acids and

aligned using Muscle. Sites that included more than 95% gaps were

removed from all amino acid sequence alignments.

The coverage algorithm

Our algorithm maximizes the coverage across an alignment by finding the

best nine-mers that can be pieced together to form multiple sequence-

constructs (see Fig. 6.5 for further details). Nine-mers that provide the

155

highest coverage at specific locations are chosen preferentially. The first

sequence-construct generated will, thus, provide the highest coverage.

Subsequent constructs each provide less coverage, as lower frequency

nine-mers are included. Combined, the different constructs provide maximal

coverage across HIV’s sequence diversity.

The local coverage provided by an individual nine-mer within a sequence-

construct is the coverage that is provided across all of the sequences within

an alignment at the location where the nine-mer occurs. The mean

coverage provided by a sequence-construct is the average of all the local

covering scores provided by each nine-mer across the sequence. Note, this

approach is not independent of location and so is distinct from a previous

definition of covering score [17] and discussed by Fischer et al. [45].

When constructing sequences our algorithm optimises the mean coverage

score within each construct. It also maximizes the number of unique nine-

mers in the sequence-construct set. Maximizing local coverage scores

ensures the inclusion of nine-mers that provide a high level of coverage of

the diversity present among HIV sequences. Nine-mers are added to the

construct in the order of their local coverage score at their optimal location

(Fig. 6.5). The optimal location is the location within the alignment where the

individual nine-mer provides the optimal achievable coverage. After a nine-

mer is included within a construct it is excluded from any future selection at

that location. Once a full sequence has been constructed overlapping nine-

mers that have not previously been removed are excluded from future

selection. This maximises the coverage by ensuring that we are not

generating redundancy due to nine-mer repetition between the different

sequence-constructs.

Random controls for Fig. 6.4 were generated after each iteration of the

algorithm by randomly sampling a number of sequences from the input data

156

set 500 times. For each sample the mean coverage provided was

calculated. The number of sequences sampled corresponded to the number

of sequence-constructs present at that point during the algorithms progress.

Acknowledgements We thank Astrid Iversen and Andrew McMichael for helpful comments and

discussion. We also thank Fred Bibollet-Ruche, Kathryn Else, Kathryn

Hentges and Katie Finegan for critical reading of the manuscript. JA and

SGW were supported by BBSRC and EPSRC studentships, respectively.

157

References 1. Gao, F., et al., Unselected mutations in the human

immunodeficiency virus type 1 genome are mostly nonsynonymous and often deleterious. J Virol, 2004. 78(5): p. 2426-33.

2. Robertson, D.L., B.H. Hahn, and P.M. Sharp, Recombination in AIDS viruses. J Mol Evol, 1995. 40(3): p. 249-59.


4. Ho, D.D., et al., Rapid turnover of plasma virions and CD4 lymphocytes in HIV-1 infection. Nature, 1995. 373(6510): p. 123-6.


6. Wolinsky, S.M., et al., Adaptive evolution of human immunodeficiency virus-type 1 during the natural course of infection. Science, 1996. 272(5261): p. 537-42.



9. Ciurea, A., et al., CD4+ T-cell-epitope escape mutant virus selected in vivo. Nat Med, 2001. 7(7): p. 795-800.

10. Draenert, R., et al., Immune selection for altered antigen processing leads to cytotoxic T lymphocyte escape in chronic HIV-1 infection. J Exp Med, 2004. 199(7): p. 905-15.

11. Li, Y., et al., Broad HIV-1 neutralization mediated by CD4-binding site antibodies. Nat Med, 2007. 13(9): p. 1032-4.

158



14. Letourneau, S., et al., Design and Pre-Clinical Evaluation of a Universal HIV-1 Vaccine. PLoS ONE, 2007. 2(10): p. e984.

15. Catanzaro, A.T., et al., Phase 1 safety and immunogenicity evaluation of a multiclade HIV-1 candidate vaccine delivered by a replication-defective recombinant adenovirus vector. J Infect Dis, 2006. 194(12): p. 1638-49.



18. De Groot, A.S., et al., HIV vaccine development by computer assisted design: the GAIA vaccine. Vaccine, 2005. 23(17-18): p. 2136-48.

19. De Groot, A.S., et al., Engineering immunogenic consensus T helper epitopes for a cross-clade HIV vaccine. Methods, 2004. 34(4): p. 476-87.

20. Fiorentini, S., et al., Functions of the HIV-1 matrix protein p17. New Microbiol, 2006. 29(1): p. 1-10.


22. McMichael, A.J., HIV vaccines. Annu Rev Immunol, 2006. 24: p. 227-55.

23. Frahm N, L.C., Brander C, Identification of HIV-Derived, HLA Class I Restricted CTL Epitopes: Insights into TCR Repertoire, CTL Escape and Viral Fitness, in HIV Sequence Compendium, F.B. Thomas Leitner, Hahn B, Marx P, McCutchan F, Mellors J, Wolinsky S,

159

Korber B, Editor. 2007, Theoretical Biology and Biophysics Group, Los Alamos National Laboratory.: Los Alamos.

24. Chelliah, V., et al., Distinguishing structural and functional restraints in evolution in order to identify interaction sites. J Mol Biol, 2004. 342(5): p. 1487-504.

25. Overington, J., et al., Tertiary structural constraints on protein evolutionary diversity: templates, key residues and structure prediction. Proc R Soc Lond B Biol Sci, 1990. 241(1301): p. 132-45.

26. DePristo, M.A., et al., Ab initio construction of polypeptide fragments: efficient generation of accurate, representative ensembles. Proteins, 2003. 51(1): p. 41-55.

27. de Bakker, P.I., et al., Conformer generation under restraints. Curr Opin Struct Biol, 2006. 16(2): p. 160-5.

28. Kent, S.J., et al., Reversion of immune escape HIV variants upon transmission: insights into effective viral immunity. Trends Microbiol, 2005. 13(6): p. 243-6.

29. Allen, T.M., et al., Selection, transmission, and reversion of an antigen-processing cytotoxic T-lymphocyte escape mutation in human immunodeficiency virus type 1 infection. J Virol, 2004. 78(13): p. 7069-78.

30. Peyerl, F.W., et al., Fitness costs limit viral escape from cytotoxic T lymphocytes at a structurally constrained epitope. J Virol, 2004. 78(24): p. 13901-10.

31. Crawford, H., et al., Compensatory mutation partially restores fitness and delays reversion of escape mutation within the immunodominant HLA-B*5703-restricted Gag epitope in chronic human immunodeficiency virus type 1 infection. J Virol, 2007. 81(15): p. 8346-51.

32. Frahm, N., et al., Increased sequence diversity coverage improves detection of HIV-specific T cell responses. J Immunol, 2007. 179(10): p. 6638-50.

160

33. Rolland, M., D.C. Nickle, and J.I. Mullins, HIV-1 group M conserved elements vaccine. PLoS Pathog, 2007. 3(11): p. e157.

34. Welsh, R.M. and R.S. Fujinami, Pathogenic epitopes, heterologous immunity and vaccine design. Nat Rev Microbiol, 2007. 5(7): p. 555-63.

35. Berman, H.M., et al., The Protein Data Bank. Nucleic Acids Research, 2000. 28(1): p. 235-242.

36. Hill, C.P., et al., Crystal structures of the trimeric human immunodeficiency virus type 1 matrix protein: implications for membrane association and assembly. Proc Natl Acad Sci U S A, 1996. 93(7): p. 3099-104.

37. Word, J.M., et al., Asparagine and glutamine: using hydrogen atom contacts in the choice of side-chain amide orientation. J Mol Biol, 1999. 285: p. 1735-1747.

38. Richardson, D.C. and J.S. Richardson, The kinemage: a tool for scientific illustration. Protein Science, 1992. 1: p. 3-9.

39. Word, J.M., et al., Exploring steric constraints on protein mutations using MAGE/PROBE. Protein Science, 2000. 9: p. 2251-2259.

40. Lovell, S.C., et al., The penultimate rotamer library. Proteins: Structure, Function and Genetics, 2000. 40: p. 389-408.

41. Word, J.M., et al., Visualizing and quantifying molecular goodness of fit: small-probe contact dots with explicit hydrogen atoms. J Mol Biol, 1999. 285: p. 1711-1733.

42. Connolly, M.L., Solvent-Accessible Surfaces of Proteins and Nucleic-Acids. Science, 1983. 221: p. 709-713.

43. Dayhoff, M.O., W.C. Barker, and L.T. Hunt, Establishing homologies in protein sequences. Methods Enzymol, 1983. 91: p. 524-45.


161

45. Fischer, W., et al., Coping with viral diversity in HIV vaccine design: a response to Nickle et al. PLoS Comput Biol, 2008. 4(1): p. e15; author reply e25.

162

Chapter 7: Detection of Low Frequency CXCR4-Using HIV-1 with Ultra-deep Pyrosequencing

Abstract Pyrosequencing produces unprecedented quantities of genomic sequence

data that can be used for ultra-deep monitoring of HIV’s intra-patient viral

population. Identification of low-frequency variants is of particular

importance in the identification of potential for pre-existing drug resistance.

Here we address the detection of potential resistance to the CCR5

antagonist maraviroc due to the pre-treatment presence of CXCR4-using

virus. To do this we present a novel protocol, implemented using the Java

programming language, for the management of pyrosequencing data and

genotypic-identification of probable CXCR4-using virus. This includes

alignment without the use of an external reference sequence (critically

important due to the high levels of variation in the variable regions),

translation of reads while maintaining inter-read alignment, determination of

per site diversity and detection of the specific amino acids in V3 associated

with phenotypic variation. We apply the protocol to two extremely large 454

data sets containing 105,000 (pre-treatment, day 1) and 192,000 (post-

treatment, day 11) reads from HIV-1’s envelope [1] in order to determine the

phenotypes present within the intra-patient population. We use the charge

rule, PSSM and geno2pheno to determine phenotype based on sequence

variation. CXCR4-using virus can be detected using all tests at a very low

frequency prior to maraviroc treatment. We then re-construct phylogenetic

trees from reads that span the V3 region. This permits a detailed

visualisation of ultra-deep intra-patient evolution through time involving 9,000

sequences (corresponding to 1,800 unique V3 variants). This phylogeny

confirms the CXCR4-using viruses at day 11 most probably emerged from

pre-existing day 1 viruses, and have not evolved directly from CCR5-using

virus.

163

Introduction During the early nineteen nineties it was observed that HIV-1 viruses could

be characterized into two phenotypes referred to as syncytium inducing (SI)

and non – syncytium inducing (NSI) [2]. These phenotypes have different

cellular tropisms due to differences in co-receptor usage [3] and appear

during different stages of infection [4]. The macrophage tropic NSI

phenotype requires the CCR5 co-receptor [5] (referred to as R5) and are

predominant during the early stages of infection [6] while the T cell tropic SI

phenotype uses the CXCR4 co-receptor (referred to as X4) [7] and often

emerge later on during and around the time of progression to AIDS [8]. The

early dominance of the CCR5-using phenotype may be due to a number of

factors including selection at the point of transmission [9-11], a higher

cytopathicity of SI variants resulting in host cells with a shorter life span [12]

as well as differing fitness levels of the two variants at different stages of

disease progression [13].

Co-receptor usage can be tested for with an experimental assay [14] or

detected by computational analysis based on specific amino acid changes

within the V3 loop of the gp120 gene [15-18]. Amino acid sequence

variations giving rise to a more positive charge within the gp120 gene at

sites 306 and 320 (sites 11 and 25 of the V3 loop) have been strongly linked

with the more virulent CXCR4-using viruses [15-18]. Site 319 has also been

observed to contribute significantly [19]. These sites oppose each other

within the beta - hairpin structure and when all negatively charged residues

are present the electrostatic interaction with the CCR5 coreceptor is thought

to be stabilized [19, 20]. These sites can be used to predict viral phenotype

with up to a 94% accuracy [19] . There are other less well understood

interactions between sites within the V3 loop where charge contributes

significantly to the presence of a particular phenotype (Figure 7.1). Both

Web PSSM [21] and geno2pheno [22] are successful genotyping algorithms

that attempt to account for more subtle changes that can take place [23].

164

Figure 7.1 Sequence logos of the CCR5 and CXCR4-using viruses

All V3 sequences were extracted from the HIV Los Alamos Sequence database and aligned

using ClustalW. Full arrowed lines represent sites used in the charge rule where positively

charged residues are only found within the CXCR4-using phenotype. Dashed arrowed lines

represent additional sites where positively charged residues were observed with the

CXCR4-using phenotype but not in the CCR5-using phenotype.

Recently Pfizer has developed a small-molecule drug, maraviroc, that binds

to the CCR5 receptor making it unavailable for HIV-1 cell entry [24, 25]. Co-

receptor blockage is a novel way of attempting to control the progression of

HIV within the host (Figure 7.2). Targeting the CCR5 co-receptor as with the

natural polymorphism (CCR5Δ32) has few deleterious effects [26].

Individuals that are heterozygous for this mutation were found to have a

slower disease progression to AIDS while individuals that are homozygotic

for the mutation show strong resistance to HIV-1 [26-30].

Treatment with maraviroc effectively mimics the CCR5Δ32 phenotype within

an individual. However tests indicate that in patients with treatment failure

the presence of CXCR4-using variants prior to treatment is predictive of this

outcome [1] as such variants will be strongly positively be selected for. It is

therefore vitally important to detect low frequency CXCR4-using variants

before treatment. The detection of low frequency variants is also a more

165

generic problem that is associated with determining the success of other

retroviral drug treatment regimes [31, 32]. With traditional Sanger

sequencing the reliable detection of such variants is not practical [33] .

Figure 7.2 Cell Entry Inhibition

Co – receptor antagonist such as maraviroc, red circle, blocking HIV cell entry.

“Ultra-deep” [34, 35] sequencing permits the quantification of the range of

sequence variants present within an HIV-1 sample [36]. Pyrosequencing

technology produces very high numbers of short sequence fragments that

can increase the sensitivity of the detection of low frequency variants but

which introduces unique problems for the computational analysis of

sequence data [37]. The process differs fundamentally from the now

traditional Sanger sequencing chemistry. For example in the Roche (454)

GS FLX system [34, 35] individual bases are not read directly. Instead the

lengths of homopolymeric runs are determined for each base at each

position of the sequence in a predetermined cyclic order. This produces 4

flowgrams, one corresponding to each base, that provides information about

which base is present at individual sites (Figure 7.3). The signal intensity

produced at individual positions on each of the four flowgram is then

166

rounded to the nearest integer and used to determine how many bases of a

particular type must be inserted.

Figure 7.3 Generation of a Flowgram

Within an individual well bases are supplied in a specific flow order. Four flows,

corresponding to the four bases, are referred to as a complete cycle. During a single flow if

the specific base type is incorporated into the growing sequence a detectible light signal is

emitted as described in the text. Multiple cycles are run with the end result being a

flowgram that can be interpreted as a nucleotide sequence.

For example the results of two complete cycles could be T:1.8, A:0.1, C:0.9,

G:0.1, T:1.6, A:0.0, C:0.4, G:1.0 [36, 38] (Figure 7.3). This would

correspond to the sequence TTCTTG. Ambiguities in signal intensity

however can lead to errors [39]. The closer the signal is to the half way

mark between two integers (e.g. T:1.6 is rounded to TT but if it was T:1.4 it

would just be T) the more likely a read error will occur in the form of either an

insertion or deletion (indel). As a result the majority of errors are due to

overcalls and undercalls [36, 38]. Noise that reduces the quality of the

emitted signal which leads to an increase in such errors can be caused by

167

signal contamination from nearby wells, multiple templates on a bead and a

loss of synchrony [36, 38]. So far the error rates within the data produced by

the Roche (454) GS FLX system are poorly characterized.

In a recent study by Wang et al., [36] where 6827 HIV-1 reads were

analyzed (8% were removed due to low sequence identity with the

templates) a mean error rate of 0.98% was observed within Pyrosequenced

data produced according to [35]. This was sub divided into insertion rates

(0.73%), deletion rates (0.16%) and mismatches (0.12%). It was further

observed that the error rate within homopolymeric regions of size three or

greater was 6.2 times higher than outside of these regions (0.07%). Due to

the large amounts of data produced however such error rates can potentially

have an effect on the results of an analysis and further efforts are required in

order to incorporate them into the processing of the 454 data.

Pyrosequencing technology can be used to track the emergence or

presence of low frequency viral variants early on during a HIV infection [40,

41]. Automated protocols are necessary due to the huge volume of data

generated. Despite this there are currently no user friendly desktop

applications available for handling and analyzing the large numbers of short

reads produced [42]. Here we present prototype of a protocol for the generic

analysis of viral population pyrosequenced data (Fig. 7.4). We demonstrate

the use of the prototype by applying it to two extremely large 454 data sets

containing 105,000 (pre-treatment, day 1) and 192,000 (post-treatment, day

11) reads from HIV-1’s envelope region [24] and use it in conjunction with

the charge rule, PSSM and geno2pheno to identify the phenotypes present

within a single patient at the two individual time points. We also compare the

indel rates within our data to the indel errors calculated by Wang et al., [36].

This unprecedented dataset is a definitive snapshot of intra patient sequence

diversity, in particular the identification of low frequency variants related to

co-receptor usage.

168

Figure 7.4 Protocol for Handling Pyrosequenced Data

Protocol for the handling of data produced by the 454 Life Science pyrosequencer. Green

circles represent steps that can be automated while orange circles represent steps that

169

potentially require external programs such as alignment tools and phenotype tests. The red

line represents the consensus template sequence generated from the individual templates

which are constructed from the input reads. In step 5 the yellow dots represent gaps

inserted into the reads during the alignment process while the purple dots represent

insertions that have been removed from the reads during processing. The sequence logo

on the bottom left hand side is a representation of the residue composition of sites 11, 24

and 25 for sequences representing each of the CCR5 and CXCR4-using phenotypes (taken

from Fig. 7.1). The webpage is the interface for the PSSM test.

Methods

Datasets

Pyrosequencing data: Two pyrosequencing datasets from patient A [24]

were provided by 454 Life Sciences. gp160Amplicons of the full-length

envelope gene from two plasma samples (days 1 and 11) were randomly

fragmented and sequenced using a 200 nucleotide average read length

protocol to a depth of greater than 10,000 reads. In total 104,628 and

191,637 nucleotide sequence reads were generated for the day 1 and day

11 sample, respectively. Patient A was infected with a variant classified as

subtype B and was one of two patients in whom CXCR4-using tropic viruses

became dominant post maraviroc treatment. 11 clones from day 1 and 12

from day 11 were also available [24].

Database gp160 sequences: All near full-length gp160 subtype B amino acid

sequences were downloaded from the Los Alamos HIV sequence database.

After removal of sequences with identical titles this left 1,986 for analysis.

These were aligned using Muscle [43] following which columns containing

more than 90% gap were removed.

Day 1 and Day 11 datasets

A general overview of the protocol developed to handle the pyrosequenced

data is presented in (Fig. 7.4). The steps of the protocol are:

170

Template Construction: Due to the high level of divergence between HIV-1

sequences, particularly in the variable regions of the envelope, a data-

specific reference sequence was constructed from each of the two

pyrosequenced datasets (Figure 7.4). Use of a consensus template

minimized the number of insertions removed from the reads during the

subsequent pairwise alignment steps. The construction of a data-specific

reference sequence involved the generation of a population consensus

sequence based on 100 template sequences constructed from concatenated

reads. These templates were each constructed from overlapping high-

identity reads. The parameters used were: maximum overlap (200

nucleotides), minimum overlap (100 nucleotides), mismatches (4) and final

sequence length (2,500 nucleotides). The mismatches are the number of

non-identical sites in the overlapping regions; permitting up to 4 ensured that

the concatenated sequences spanned most of gp160. These were based on

the average read length and approximate length of gp160 (Figure 7.5).

In total 150 concatenated template sequences were constructed for each of

the two datasets. As the reads are in both directions, the concatenated

sequences not in the direction 5’ to 3’ (74 and 86 for days 1 and 11,

respectively) were complemented using HXB2 as a reference. The

consensus template sequence was constructed from the 100 of these that

maximised the coverage of gp160. These were aligned with Muscle [43].

Sites containing >90% gaps were removed from the alignment and a

majority-rule consensus sequence constructed.

Pairwise Alignments: All reads were then aligned in a pairwise manner to the

consensus template sequence using the Smith-Waterman algorithm [44] with

the parameters: gap opening penalty (-4), gap extension penalty (-2), match

(+2), transition (-1) and transversion (-2). To maintain site compatibility

between reads and ensure the removal of all pyrosequencing based errors

171

that cause deleterious frame shifts in relation to the consensus template

sequence any insertions within the reads in relation to the template

sequence were removed. A more precise method of dealing with these

insertions is discussed in the discussion section. The length and

frequencies of the insertions (along with any deletions) where recorded.

During the pairwise alignment any reads producing an exceptionally poor

alignment (Figure 7.5) were removed from the dataset.

Translations: Each of the aligned reads was then translated into all three

reading frames and the highest scoring translation after realignment with the

corresponding region on the translated template sequence selected.

Detection of CXCR4-using phenotype: Genotypic detection of CXCR4-using

virus in the V3 region was carried out on all pyrosequenced reads for the two

datasets:

(1) Partial V3 reads: Sites 11, 24 and 25 are predictive of CXCR4-using

variants [19, 20] so were extracted from the datasets maintaining site

compatibility between reads and used for the charge rule test. In addition,

sites 7, 8, 22 and 30 were used as these are also predictive of co-receptor

usage (Figure 7.1). For CXCR4-using strains designation a positive charge

had to be present in at least one of these sites; for the CCR5-using strains

designation no positive charge could be present. Thus, if only two sites were

present and found to be negatively charged the phenotype could still not be

guaranteed as being CCR5-using as the remaining unrepresented sites

could contain a positively charged residue. For consistency the charge at all

sites had to be determined.

(2) Complete V3 reads: Translated reads that span the entire V3 region were

also extracted from the datasets. The phenotype of the reads was predicted

using charge rule as with the partial fragments, PSSM [21] and geno2pheno

172

[22]. Fewer sequences could be tested in this manner due to the

requirement of full length V3 regions as opposed to the individual site

requirement above.

Phylogenetic analysis: The corresponding nucleotide sequences for the

complete V3 regions were aligned using Muscle. PhyML [45] was used to

re-construct maximum likelihood trees using the HKY model of nucleotide

substitution with the ratio between transversions to transitions being et to 2.

Database entropy and amino acid usage

The per site Shannon entropy was calculated using the Shannon entropy

formulae as described on the Los Alamos HIV database website from the

sequences that were extracted from the database. Site compatibility was

maintained with the templates constructed for the day 1 and day 11 data by

aligning the consensus template sequences in with the database data and

only maintaining columns that did not have gaps for these sequences.

Results

The majority of pyrosequenced reads aligned to the template sequence with

high alignment scores, >0.8, for both datasets (Figure 7.5 B and D). The

peak of low quality pairwise alignments, <0.2, mainly correspond to longer

sequence fragments (>250 nucleotides). Low identity reads, <0.6, were

excluded from further analysis leaving 94,029 and 177,459 reads,

corresponding to 18,944,164 and 35,724,117 bases, for days 1 and 11

datasets respectively. This was a removal of 10.1 and 7.4% of the data

respectively. These numbers of low identity reads are similar to those

observed in Wang et al., [36] where a first generation pyrosequencer (454

Life Science GS20) was used. From Figure 7.5, Panels A and C, it can be

observed that the majority of reads within these lower peaks are greater than

250 nucleotides in length.

173

Figure 7.5 Segment Length, Alignment Score and Frequency

(A and C) The relationship between read length and score for day 1 and day 11 data

respectively, (B and D) the relationship between the score and frequency of occurrence for

both datasets respectively. The blue shaded area represents reads that showed good

sequence identity with localized regions of the template. The red shaded area represents

reads that had very little or no sequence identity with the template sequence.

The coverage obtained across the gp160 region for the day 1 and day 11

data is displayed in Figure 7.6. All regions of gp160 are well represented.

Note, the coverage peak towards the 3’ end of gp160 is an artefact of the

sequencing process that has been previously observed [46].

174

Figure 7.6 Nucleotide Coverage Across gp160

Coverage provided across the gp160 region by the day 1 (blue) and day 11 (red) datasets

after sequences of low identity have been removed. Genomic regions are depicted along

the x-axis. Conserved regions are shaded black while the variable regions are shaded grey.

The inset displays the frequency of occurrence of individual read lengths.

Indel frequencies in relation to the consensus template are described in table

7.1. These are the frequencies after low identity sequences have been

removed. The mean insertion frequency in for the day 1 data set was

0.8549% while the mean deletion frequency was 0.4796%. For the day 11

dataset these numbers were 1.2107 and 0.7423% respectively. The

removal of indels using the reference sequence as a guide conservatively

maintains the correct reading frames within individual reads. However

potentially viable indels that occur will also be removed during this process

thus a small proportion of the data is lost. If viable these would be in

multiples of 3 nucleotides corresponding to codon lengths. For insertions

and deletions these constituted 0.0054 and 0.00057% for day 1 and 0.0995

and 0.0763 for day 11 respectively (highlighted green in table). When these

are removed from the analysis the potential error frequencies become

0.8495 and 0.4790% for day 1 and 1.1112 and 0.6659 for day 11

respectively.

175

Table 7.1 Rates of Insertion and Deletion

Frequencies of the length of indel regions stored during the pairwise alignment step of the

protocol displayed in figure 7.4.

In order to display per site diversity, Shannon entropy was calculated for

each site across gp160 for the two datasets and compared to subtype B

sequences from the Los Alamos HIV sequence database (Figure 7.7). On

day 1 the mean entropy across the gp120 region is 0.2 with the mean for the

variable regions being 0.32 and for the conserved regions being 0.18. For

gp41 the mean entropy is 0.13. On day 11 the mean entropy across gp120

is 0.18 with the mean for the variable regions being 0.26 and for the

conserved regions being 0.15. The gp41 has entropy of 0.14. The entropy

at individual sites 7, 8, 11, 22, 24, 25, and 30 within the V3 loop are

highlighted in the figure (panels B and C).

176

Figure 7.7 Shannon Entropy Across gp160

Entropy plot for (A) the subtype B sequences extracted from the HIV Los Alamos sequence

database, (B) the day 1 dataset, (C) the day 11 dataset and (D) the difference between the

177

day 1 and day 11 datasets. In (D) the shades grey region covers 90% of the data. Various

locations within the gp160 gene are displayed along the x-axis.

The outcome of the phenotype tests used on the day 1 dataset are displayed

in Table 7.2 Panel A. The charge rule only required the presence of sites

11, 24 and 25 within the V3 and so many more partial V3 reads could be

tested (5986) when compared to the PSSM test and geno2pheno (3384) as

both required complete V3 regions. Of the 5986 reads only 18 reads

(0.30%) of the CXCR4-using phenotype were detected using the complete

charge rule.

Table 7.2 HIV-1 Phenotypes Counts

Phenotypes counts for day 1 (A) and Day 11 (B) data using the three tests described in the

text.

Of the 3384 full-length V3 reads in the day 1 data set, CCR5-using strains

were the predominant phenotype. Only 7 reads were classified according to

PSSM as being CXCR4-using. Geno2pheno classified 33 reads as CXCR4-

using. 11 CXCR4-using reads were observed using the charge rule on

178

these full-length V3 reads. In this case the number of potentially falsely

predicted CXCR4-using reads using the charge rule drops to 4 (fewer total).

For the day 11 dataset 12,086 V3 fragments were available for testing using

the charge rule, while 6687 complete V3 reads were available for

geno2pheno and PSSM (Table 7.2 Panel B). Of the full-length V3’s, 5383

reads were classified as CXCR4-using according to PSSM while 5479 were

classified as CXCR4-using according to geno2pheno. 5,423 of the complete

V3 reads were classified as CXCR4-using according to the charge rule.

When the charge rule was run on the 12,086 containing both full and partial

V3 fragments 9780 could be classified as CXCR4-using. Individually all

three tests displayed very little differences between the percentages of the

CXCR4-using phenotype constituting the dataset.

For both datasets when the charge rule is performed with the inclusion of

sites 7, 8, 22 and 30 very little alteration of the ratio’s between the CXCR4-

using and CCR5-using viruses was observed (Table 6.2 Panel A and B).

The extra positive charge detected at these sites within the CXCR4-using

viruses did not seem to play a significant role in determining CXCR4-usage.

In relation to the charge rule for individual sites, and for complete v3 regions,

the presence of a residue at each of the sites was a requirement. This is

necessary in order to reduce the number of reads being falsely classified as

CCR5-using. However in relation to the detection of the CXCR4-using

phenotype every residue does not need to be present. This is because as

long as a single positive charge exists at any of the sites the CXCR4-using

viruses can be determined.

Table 7.3 Panels A and B is an updated version of table 7.2 where the

CXCR4-using phenotype is determined based on a positive charge being

present at any one of the sites 11, 24 and 25 regardless of whether one or

more residues are missing. The loss of the condition that a residue has to

179

be present at each of the sites tested allows for a greater number of reads to

be used.

Table 7.3 Alternate Phenotype Counts

CXCR4-using phenotype counts for day 1 (A) and Day 11 (B) data when a single positive

charge is present at individual sites regardless of whether residues are present at all other

sites or not.

Figure 7.8 displays a maximum likelihood tree consisting of all the V3

segment data from days 1 and 11. In total there are 1799 unique fragments

representing 9071 variants (3384 from day 1 and 6687 from day 11). The 23

cloned sequences described in Westby et al., [1] from both days have been

truncated and added to the alignment (not shown). It was be observed that

the cloned sequences make up only a small proportion of the diversity that is

present on the tree. Clusters of the day 1 CCR5-using (green), day 11

CXCR4-using (orange) and day 11 CCR5-using (blue) V3 reads appear to

be strong with a couple of exceptions.

181

Figure 7.8 Phylogenetic Tree of Day 1 and Day 11 V3 Segment Data

(A) Maximum likelihood tree containing all the full length V3 nucleotide reads isolated from

both the day 1 and day 11 datasets. Phenotype classifications (based on both PSSM and

the charge rule) are represented according to the colours described in the figure key. (B) Identical tree to that of (A). Colours describe the number of variants present at the tips

instead of phenotype.

Day 1 CCR5-using strains are not present within any of the day 11 clusters.

In contrast within the main day 11 CXCR4-using cluster three day 1 CXCR4-

using variants can be observed to be located near the centre of the tree.

Considering that there were a maximum of 11 day 1 CXCR4-using variants

in total this is a substantial number of day 1 CXCR4-using variants to be

located within the day 11 CXCR4-using cluster. This indicates that the day

11 CXCR4-using variants may have arose from CXCR4-using variants that

were present at very low frequency within the population at an earlier time

point following drug treatment.

Discussion We have demonstrated the use of a novel protocol (Fig. 7.3) for the accurate

handling of the large amounts of short pyrosequenced data generated by the

Roche (454) GS FLX system [34, 35]. Using our protocol, in conjunction

with three standard phenotyping methods on 3384 full length V3 reads, 11

(charge rule), 7 (PSSM) and 33 (geno2pheno) CXCR4-using variants were

detected within the day 1 dataset that had been previously screened for the

absence of the CXCR4-using viruses. When 5986 partial and complete V3

reads were tested using the charge rule the number of CXCR4-using

variants detected increased to 18.

For the day 1 dataset the 11 (from the 3,384 complete V3 sequences) and

18 (from the 5,986 complete and partial sequences) CXCR4-using’variants

that were detected using the charge rule are unlikely to be due to

182

substitution sequencing error. When the generic 0.1% pyrosequencing error

rate (for substitutions) [36] is applied to the nucleotides determining the

residues present at sites 11 and 24, assuming two non-synonymous

positions in each codon, the maximum number of expected false predictions

is less than 4 and 6 respectively. The ratio of positively charged codons to

negatively charged codons when determining this error rate is taken to be

8/53 as there are 8 codons resulting in a histidine, lysine or arginine. Site 25

was observed not to contribute to the phenotype switch within the day 1

dataset and so was omitted form the calculation.

When additional sites 7, 8, 22 and 30 were used for an extended charge rule

test (Table 7.3), on 10,204 full and partial V3 fragments, the number of reads

predicted to be of the CXCR4-using variants rose to 33 while the potential

error rate increases to <28 due to the addition of the extra sites. These extra

sites, where additional positive charge was observed within the CXCR4-

using variants (Figure 7.1), do not appear to have a major influence on

increasing the number of CXCR4-using variants detected. This suggests

that they may have changed amino acid residues after the phenotype switch

occurred and may not be of primary importance during the switch itself.

Detailed analysis of V3 loop data is thus required in order to determine the

exact evolutionary relationships between the sites within the region and the

role that each site plays in determining phenotype. Knowing that sites 11, 24

and 25 have the strongest influence [19, 20] what co-evolutionary

relationships exist between these sites and other sites within the region? For

example, could positive charge at other sites be solely responsible for co-

receptor switching or could more positively charged residues occur within the

region to increase the affinity for the CXCR4 co-receptor after the initial

switch has occurred based on the charge present at sites 11, 24 ad 25? Our

results would indicate the importance of sites outside of 11, 24 and 25 could

be limited in relation to the initial phenotype switch.

183

Within the day 1 and day 11 dataset indels accounted for 1.3345 and

1.953% of the data respectively. This is the percentage of the total number

of bases sequenced on each day. For the day 1 dataset 64% of the data

constituted of insertions while the remaining 36% was made up of deletions.

For the day 11 dataset these numbers were 62% and 38% respectively.

This ratio reflects those observed by Wang et al., on a smaller dataset [36]

where insertions were observed to be more frequent than the deletions. The

overall error rate calculated by Wang et al., was 0.98% of the dataset. The

difference between this error rate and our indel rates is presumably due to

the larger size of our data set.

The percentage of codon (triplet) insertions within the day 1 dataset was

observed to be 0.0054 (Table 6.1). For the charge rule three codons are

used to determine the CXCR4-using viruses. The chance that a complete

codon insertion will occur at any of these sites within an individual read thus

potentially disrupting the charge rule result is 0.0054%. Within an individual

read the chance of having an altered false prediction based on a codon

insertion is thus 0.0162%. With 5,986 reads being tested the number of

expected erroneous predictions based on such insertions is be 0.9697. This

is a very small increase in the error rate and when this is combined with the

expected error due to mismatches (discussed above) the number of CXCR4-

using’s detected on day 1 still above the expected error – confirming the

presence of CXCR4-using within the day 1 dataset.

An analysis of the day 1 and day 11 phylogeny combined (Fig. 7.8) reveals

that at the centre of the main day 11 CXCR4-using cluster three day 1

CXCR4-using variants (red) can be identified. These are located closer to

the root of the cluster than the day 11 CCR5-using variants (blue) that can

be identified within the cluster. It is most probable that this cluster has

emerged through the selection of day 1 CXCR4-using variants following

184

maraviroc therapy as no day 1 CCR5-using variants are located within the

vicinity of the day 11 CXCR4-using variants.

The mean amino acid entropy values for the various regions within the

gp160 gene are displayed in Figure 7.7. The entropy values for the global

subtype B data are higher than that of the day 1 and day 11 intra patient

data. More of HIV-1’s viable sequence space is being explored at an inter

host level due to the genetics of different hosts rather than within a single

individual host at a single time point in the immune history [47]. It is

interesting to note in each of the datasets the higher entropies at the

exposed surface loops of the envelope protein when compared to the more

protected internal regions.

Both the day 1 and day 11 datasets provided complete coverage across the

gp160 region even after sequences of low identity were removed (Fig. 7.6).

When the read lengths were plotted against the pairwise alignment score

with the template it was observed that at reads of size of 250 and less there

was no correlation between the length and the alignment score (Fig. 7.5,

Panels A and C). However reads larger than 250 nucleotides were often

associated with a poorer score. This emphasises that the current

pyrosequencing technology is not reliable for producing reads of longer

length [35]. This is possibly due to a number of cumulative factors

contributing to noise within individual wells during sequencing [36, 38].

The short nature of pyrosequencing data imposes some implicit problems,

such as template construction, storage, alignment, translation and isolation

of individual sites that our protocol overcomes. For the first time we have

provided a high resolution snapshot of intra patient data pre and post

treatment with the CCR5 inhibiting compound maraviroc. We have also

provided the ability to repeat this analysis on any dataset. Using this

snapshot we have presented evidence supporting the theory that the rapid

185

emergence of the CXCR4-using viruses following treatment of Patient A with

maraviroc was due to the pre treatment presence of CXCR4-using virus.

This presence was previously undetectable. Thus usage of maraviroc is

indeed a feasible option in the development of novel treatment regimes for

HIV when used in the absence of CXCR4-using viruses [1].

Future Developments

The protocol developed for managing the data output from the 454

sequencer played a central role in the analysis. It is intuitive, user friendly

and straightforward to implement. Its assumptions based on the treatment of

errors were discussed in the previous section. Although the effects of these

errors are small in relation to how the protocol conservatively dealt with

them, as discussed, the amount of data available makes it important to be as

precise as possible. Distinctions between pyrosequencing error and

potential naturally occurring error must be defined. The differences in these

error rates are not well understood [36, 38] and will have an effect (although

minimized within our protocol) on the accuracy of using pyrosequenced data

in relation to detecting low frequency variants within populations.

The observed error rates (Table 6.1 and [36]) confirm that preprocessing of

the 454 data is an important requirement in order to increase the accuracy

and reliability of results in relation to the detection of low frequency variants.

A proposed modified protocol that is based closely on Figure 7.4 is

presented in figure 7.9. The modified version accounts for insertions within

the reads by incorporating a stage where the consensus template is

extended based on the frequencies of inserted complete codons within the

reads. Segments will be realigned to this extended consensus template

reducing the need to loose viable insertions that do not disrupt the reading

frame. Translated reads will be aligned to the translated template sequence

in order to get a more precise positioning of amino acid residues that are not

dependent on the original nucleotide alignment. The amino acid alignment

186

will then be used in order to obtain residues at individual sites or across

regions for phenotype testing as well as to refine the original nucleotide

alignments. The refined nucleotide alignments can be used in order to

obtain an exact measure of the intra host mutation rates within the HIV

lifecycle. Tree re-construction will occur as before but will use the realigned

reads so that more phylogenetically relevant information will be used to

determine the nature of clustering within the reads.

187

Figure 7.9 Updated Protocol

An extended version of the protocol presented in figure 7.4. Grey boxes highlighted the

parts to be updated as described in the text.

188

Acknowledgements We would like to thank Richard Harrigan from the British Columbia Centre of

Excellence in HIV/AIDS, 454 Life Sciences for supplying the 454 sequence

data, the investigators, study-site staff, the Pfizer maraviroc development

team and the patients who participated in the maraviroc studies. In addition

we thank Alex Thielen for help with geno2pheno. This project was funded by

the BBSRC and Pfizer Global R&D.

189

References 1. Westby, M., et al., Emergence of CXCR4-using human

immunodeficiency virus type 1 (HIV-1) variants in a minority of HIV-1-infected patients following treatment with the CCR5 antagonist maraviroc is from a pretreatment CXCR4-using virus reservoir. J Virol, 2006. 80(10): p. 4909-20.

2. Koot, M., et al., HIV-1 biological phenotype in long-term infected individuals evaluated with an MT-2 cocultivation assay. Aids, 1992. 6(1): p. 49-54.

3. Moore, J.P., et al., The CCR5 and CXCR4 coreceptors--central to understanding the transmission and pathogenesis of human immunodeficiency virus type 1 infection. AIDS Res Hum Retroviruses, 2004. 20(1): p. 111-26.

4. Shankarappa, R., et al., Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. J Virol, 1999. 73(12): p. 10489-502.


6. Connor, R.I. and D.D. Ho, Human immunodeficiency virus type 1 variants with increased replicative capacity develop during the asymptomatic stage before disease progression. J Virol, 1994. 68(7): p. 4400-8.

7. Bjorndal, A., et al., Coreceptor usage of primary human immunodeficiency virus type 1 isolates varies according to biological phenotype. J Virol, 1997. 71(10): p. 7478-87.

8. Koot, M., et al., Conversion rate towards a syncytium-inducing (SI) phenotype during different stages of human immunodeficiency virus type 1 infection and prognostic value of SI phenotype for survival after AIDS diagnosis. J Infect Dis, 1999. 179(1): p. 254-8.


190





14. Whitcomb, J.M., et al., Development and characterization of a novel single-cycle recombinant-virus assay to determine human immunodeficiency virus type 1 coreceptor tropism. Antimicrob Agents Chemother, 2007. 51(2): p. 566-75.

15. Hwang, S.S., et al., Identification of the envelope V3 loop as the primary determinant of cell tropism in HIV-1. Science, 1991. 253(5015): p. 71-4.

16. Shankarappa, R., et al., Evolution of human immunodeficiency virus type 1 envelope sequences in infected individuals with differing disease progression profiles. Virology, 1998. 241(2): p. 251-9.

17. Shioda, T., J.A. Levy, and C. Cheng-Mayer, Small amino acid changes in the V3 hypervariable region of gp120 can affect the T-cell-line and macrophage tropism of human immunodeficiency virus type 1. Proc Natl Acad Sci U S A, 1992. 89(20): p. 9434-8.

18. Pollakis, G., et al., Phenotypic and genotypic comparisons of CCR5- and CXCR4-tropic human immunodeficiency virus type 1 biological clones isolated from subtype C-infected individuals. J Virol, 2004. 78(6): p. 2841-52.


191

20. Rosen, O., et al., Molecular switch for alternative conformations of the HIV-1 V3 region: implications for phenotype conversion. Proc Natl Acad Sci U S A, 2006. 103(38): p. 13950-5.

21. Jensen, M.A., et al., Improved coreceptor usage prediction and genotypic monitoring of R5-to-X4 transition by motif analysis of human immunodeficiency virus type 1 env V3 loop sequences. J Virol, 2003. 77(24): p. 13376-88.

22. Sing, T.B., N. Kaiser, R. Hoffmann, D. Däumer, M. Lengauer, T., Geno2pheno[coreceptor]: A tool for predicting coreceptor usage from genotype and for monitoring coreceptor-associated sequence alterations. 2005.

23. Lengauer, T., et al., Bioinformatics prediction of HIV coreceptor usage. Nat Biotechnol, 2007. 25(12): p. 1407-10.



26. O'Brien, T.R., et al., HIV-1 infection in a man homozygous for CCR5 delta 32. Lancet, 1997. 349(9060): p. 1219.




192

30. Meyer, L., et al., Early protective effect of CCR-5 delta 32 heterozygosity on HIV-1 disease progression: relationship with viral load. The SEROCO Study Group. Aids, 1997. 11(11): p. F73-8.

31. Kapoor, A., et al., Sequencing-based detection of low-frequency human immunodeficiency virus type 1 drug-resistant mutants by an RNA/DNA heteroduplex generator-tracking assay. J Virol, 2004. 78(13): p. 7112-23.

32. Palmer, S., et al., Persistence of nevirapine-resistant HIV-1 in women after single-dose nevirapine therapy for prevention of maternal-to-fetal HIV-1 transmission. Proc Natl Acad Sci U S A, 2006. 103(18): p. 7094-9.

33. Palmer, S., et al., Multiple, linked human immunodeficiency virus type 1 drug resistance mutations in treatment-experienced patients are missed by standard genotype analysis. J Clin Microbiol, 2005. 43(1): p. 406-13.




37. Pop, M. and S.L. Salzberg, Bioinformatics challenges of new sequencing technology. Trends Genet, 2008. 24(3): p. 142-9.

38. Brockman, W., et al., Quality scores and SNP detection in sequencing-by-synthesis systems. Genome Res, 2008.

39. Quinlan, A.R., et al., Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nat Methods, 2008. 5(2): p. 179-81.

40. Hoffmann, C., et al., DNA bar coding and pyrosequencing to identify rare HIV drug resistance mutations. Nucleic Acids Res, 2007. 35(13): p. e91.

193

41. O'Meara, D., et al., Monitoring resistance to human immunodeficiency virus type 1 protease inhibitors by pyrosequencing. J Clin Microbiol, 2001. 39(2): p. 464-73.



44. Smith, T.F. and M.S. Waterman, Identification of common molecular subsequences. J Mol Biol, 1981. 147(1): p. 195-7.

45. Guindon, S. and O. Gascuel, A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol, 2003. 52(5): p. 696-704.

46. Lewis, M., et al., Evaluation of an Ultra-Deep Sequencing Method to Identify Minority Sequence Variants in the HIV-1 env Gene from Clinical Samples, in 14th Conference on Retroviruses and Opportunistic Infections. 2007: Los Angeles, USA.

47. Grenfell, B.T., et al., Unifying the epidemiological and evolutionary dynamics of pathogens. Science, 2004. 303(5656): p. 327-32.

194

Chapter 8: Final Discussion

The extensive diversity generated within HIV-1’s phylogeny is daunting in

relation to the production of a successful therapeutic or preventative vaccine

[1-3]. Recently, after more than 33 million deaths since the introduction of

the virus into the human population, the scale of this task has been

highlighted by the failure of the once promising Merck STEP vaccine clinical

trials [4, 5]. This, along with other failures [3, 6], suggests that some

significant decisions about the direction of research in relation to HIV-1 must

be made. For example: in the face of such diversity is it actually possible to

design a successful vaccine or are researchers misplacing their efforts? If

the latter is the case should there be more focus on non vaccine based

strategies of reducing viral load within individual hosts such as co-receptor

blockage? From the chapters of this thesis it could be suggested that it is

not yet time to abandon hope on the development of a therapeutic or

preventative vaccine – although novel algorithmic approaches in relation to

its development are required as the traditional approaches to vaccine design

will not be effective in dealing with the diversity being generated by HIV-1.

Let us begin this concluding chapter by returning to the clear quantifiable

differences that exist between the global group M subtypes and the poorly

defined group O clusters [7]. In chapter 4, in agreement with Roques et al.,

[8], it was proposed that it is inappropriate to draw direct parallels between

these two groups in relation to diversity present within clusters observed on

each of their phylogenetic tree topologies. The weak clustering found within

the group O phylogenetic topology has emerged on a localized scale and is

not significant in relation to randomly generated trees (Fig. 4.3). As a result

group O clusters contain little useful sequence information in relation to

designing cluster specific vaccines. Usage of individual strains, consensus

sequences [9-11] or even novel algorithmic approaches (chapter 6) in

attempts to develop a cluster specific vaccine for the group would most likely

195

fail and would be a wasteful diversion of resources. In contrast the well

defined global group M subtypes have emerged as a result of a more

complicated epidemiological history (Fig. 4.5) involving the passaging of the

virus in association with different localized risks groups following severe a

founder effect [12, 13]. However, even in the case of these well defined

subtypes the use of subtype specific vaccines, although initially promising

[11], fails as a result of the extensive diversity present. Traditional

approaches for vaccine design [9-11] are simply not going to work.

Novel algorithmic approaches, moving away from a dependency on tree

topology, are required in order to generate artificial sequence constructs that

maximize the coverage of the epitope diversity present within subsets of the

sequence variation constituting the global pandemic [14, 15]. Such

constructs could then potentially be used as a starting point for the design of

a polyvalent vaccine. To date, early attempts at algorithmic approaches,

although promising, have largely ignored the realism that not all variation

within sampled strains is of significance (Fig. 6.2, Panel A). With the

removal of “unimportant” variation improved algorithms can be developed

(Fig. 6.5) in order to construct artificial sequence constructs (Fig 8.1) that

maximize the epitope coverage across the meaningful diversity within

individual subtypes in relation to specific geographic locations or within

specific risk groups (Fig. 6.4).

Such approaches could be improved by accounting for additional features of

protein structure and function, such as functional sites [16], the requirement

to maintain binding interface [17], and intra- and inter- molecular interactions

[18], as well as by the inclusion of known antigenic epitopes [19]. The latter

is especially important to consider as by maximizing nine-mer coverage

across the alignment, without consideration of “good” epitope inclusion, the

algorithm is implicitly selecting for the inclusion of more conserved nine-mers

within the construct sequences (Fig. 8.1). Such nine-mers themselves may

196

not be the ideal antigenic targets as intuitively their conservation implies that

they may be under less immune pressure than the less conserved nine-

mers. It is therefore important to insure the inclusion of known epitopes

along with the inclusion of high frequency nine-mers.

Figure 8.1 Novel Approach to Generating Optimized Constructs

Examples of sequence constructs generated in chapter 6 across the p17 region for HIV-1

group M sequences (orange box). The plots were taken from figure 6.4 panels A and C.

The numbers below the sequence’s in the orange box represent the local covering score of

selected nine-mers as described in chapter 6.

These improvements can be incorporated, with some minor modifications to

our algorithm (Fig. 6.5), into the approach that we have taken within chapter

6 during the sequence construction process after unnecessary diversity has

been removed from the input datasets. This is the topic of future work. The

sooner the effects of these improvements are investigated the sooner we will

know if efforts to produce a therapeutic vaccine are viable. If such novel

approaches fail, like previous more traditional attempts [3-6], the future for

vaccine design will start to look bleak.

197

If polyvalent cocktails of artificial sequence constructs, providing adequate

epitope coverage for a given subtype within specific geographic regions or

specific risk groups, could eventually be created using an algorithmic

approach – the problems caused by the extensive diversity present within

HIV-1’s phylogeny would still be far from over. This is because the subtype

classification system has been devised through the sampling of strains

globally [12, 13]. Presently strains from the DRC region are classified

according to the global subtype that they fall closest to (Chapter 4). The

distinctness of these “globally”-defined “pre-DRC data” subtypes is due to

individual strains being exported from the DRC region followed by

subsequent diversification within localized host populations (Fig. 4.5). They

do not accurately represent the diversity that is present within the centre of

the pandemic (Fig. 4.4). Classifying strains from the DRC region according

to the current global classification system hugely under represents this

extensive diversity within the region [12, 13, 20].

As a result a globally defined subtype specific cocktail of artificially

constructed sequences could not guarantee protection within the centre of

the pandemic for the strains classified according the subtype that the cocktail

was designed for. Further more within specific geographical regions or

global risk groups a single subtype specific cocktail could not guarantee

adequate protection against newly emerging divergent strains from the

epicenter that have been loosely classified according to the current global

subtype classification system. Recombination between the current

subtypes, where geographic regions or risk groups come into contact with

each other - thus allowing for the occurrence of dual infection, would also

lead to problems for such subtype specific polyvalent cocktails due to the

generation of new divergent strains. A dynamic approach to vaccine

development is thus required that would involve the constant monitoring of

the diversity present within subtypes, risk groups and geographic regions

198

followed by the continuous updating of the sequences included within the

polyvalent cocktail covering these specific risk groups and regions.

Despite these problems the potential ability to cover large portions of the

global population away from the epicenter with a preventative or therapeutic

vaccine would be a breakthrough in relation to tackling the global pandemic.

But with the epicenter left without a preventative or therapeutic vaccine, even

within a globally vaccinated population there will always be the possibility of

reemerging strains forming new resistant clusters (Fig 4.5). Unfortunately

the extent of diversity present within the epicenter (Fig 4.4) makes the

possibility for the development of a polyvalent vaccine for the region highly

remote. Thus, although it may be eventually possible to control the virus

globally and reduce the number of infections, the threat of remerging strains

will remain a long term problem for the foreseeable future.

As briefly mentioned above a major contributory factor to this problem are

recombination events which contribute significantly to the diversity present

within HIV-1’s phylogeny at both an intra- and inter- host level (Chapter 5).

However not all recombinant breakpoints are equivalent in relation to the

persistence of the virus within the global pandemic. Across much of the viral

genome (Fig. 5.4) breakpoint positioning can be explained by a simple

mechanistic process that is largely described by the model presented in

Chapter 5. Regions that deviate from the underlying mechanistic breakpoint

distribution are clearly of greater significance in terms of the global

pandemic. Primarily either end of the envelope gene identifies it as a region

that is selectively transferred from one genetic background into another as

an integral cassette.

There is a need to focus on the identifying the functional significance of

recombination in vivo rather than the blind mapping of breakpoints. Our

results in Chapter 5 emphasize that detailed mapping of individual HIV-1

199

recombinant structures should be considered in the context of a probabilistic

expectation generated by the process of template switching during reverse

transcription. Individual recombination breakpoints, analogous to point

mutations, will have varying consequences for viral persistence in infected

individuals and populations. Our findings provide the first clear indication of

how recombinant forms predominantly influence the viral population in the

ongoing AIDS pandemic.

So far within this concluding chapter the focus has largely been on HIV-1’s

diversity in relation to an inter host level. At an intra patient level the

generation of diversity is also immense [21, 22]. However during the early

stages of infection there is a large dominance of the CCR5-using phenotype

[23]. Reasons for this are unknown but may include selection at the point of

transmission [24-26], a higher cytopathicity of CXCR4-using variants

resulting in host cells with a shorter life span [27] as well as differing fitness

levels of the two variants at different stages of disease progression [28]. The

consequence of the CCR5-using phenotype dominance is that it provides a

novel drug target early on within the host.

The recently developed a small-molecule drug, Maraviroc, attempts to

control viremia by blocking the CCR5 co-receptor thus making it unavailable

for HIV-1 cell entry [29, 30]. This effectively simulates the CCR5Δ32

phenotype [31-34] observed within some individuals that results in either a

delayed progression to AID’s (heterozygous) or a strong resistance to HIV

(homozygous). However with the observation that the presence of the

CXCR4-using phenotype pre treatment predicts treatment failure [35] it has

become vitally important to the producers of Maraviroc, Pfizer, to develop

cost effective genotypic system for the detection of such variants that exist

within the population in order to determine the effectiveness of the drug

within an individual. Pyrosequencing technology makes the cheap and

accurate detection of these low frequency variants possible [36-39] for

200

individual patients. Until recently the analysis of large amounts of short

pyrosequencing data has proven a monumental task [40].

In chapter 7 we develop and present a protocol (Fig. 7.4) that simplifies the

handling of pyro sequenced data. We used the protocol in conjunction with

three known HIV-1 phenotype tests in order to detect the pre treatment

presence of low frequency CXCR4-using variants within an individual drug

failure patient. Phylogenetic analysis (Fig. 7.8) and phenotype test results

(Table 7.2) strongly confirmed, in agreement with Westby et al., [35], that

these variants initially emerged pre treatment and that the drug was most

probably not responsible. Later on during treatment the low frequency

variants were selected for and so there was a failure in the reduction of

viremia due to the positive selection of the CXCR4-using phenotype. In a

well managed environment, with sensitive low frequency variant detection

tools – such as that presented in figure 7.4, maraviroc remains a good

candidate for the control of viremia within some infected individuals. With

such novel compounds and new approaches to vaccine design it is possible

to remain optimistic about the future development of preventative and

treatment strategies for HIV.

In relation to the protocol presented in figure 7.4 future work will involve

implementing this protocol as a software package that can be used routinely

to process such data. Such a package will make the advantages of

pyrosequencing readily available to large portions of the HIV-1 research

community.

201

Conclusion

With our current understanding of the diversity present within HIV-1 there

appears to be potential for progress in relation to the development of

therapeutic or preventative vaccine in the not to distant future. If this does

fail however there is still the possibility of developing successful treatment

regimes with novel compounds such as maraviroc. The former would be a

more affordable option in relation to the many countries where expensive

and tiresome drug regimes would not be practical. The latter would require

the development of more sophisticated tools for the careful monitoring of low

frequency viral variants within individual hosts. Despite this optimism, if a

successful vaccine or treatment regime were to be produced, HIV-1 is going

to remain a long term problem within the human population. This is primarily

as a result of the diversity generated at the epicenter of the pandemic. The

best we can hope for in the near future is the better management of the

pandemic in relation to the geographic areas outside of this region.

202

References

1. Garber, D.A., G. Silvestri, and M.B. Feinberg, Prospects for an AIDS vaccine: three big questions, no easy answers. Lancet Infect Dis, 2004. 4(7): p. 397-413.

2. Sheppard, N. and Q. Sattentau, The prospects for vaccines against HIV-1: more than a field of long-term nonprogression? Expert Rev Mol Med, 2005. 7(2): p. 1-21.

3. Cohen, J., Planned tests in Thailand spark debate. Science, 1997. 276(5316): p. 1197.

4. HIV vaccine failure prompts Merck to halt trial. Nature, 2007. 449(7161): p. 390.

5. Cohen, J., AIDS research. Promising AIDS vaccine's failure leaves field reeling. Science, 2007. 318(5847): p. 28-9.

6. McCarthy, M., HIV vaccine fails in phase 3 trial. Lancet, 2003. 361(9359): p. 755-6.





11. Weaver, E.A., et al., Cross-subtype T-cell immune responses induced by a human immunodeficiency virus type 1 group m consensus env immunogen. J Virol, 2006. 80(14): p. 6745-56.


203

13. Worobey, M., The Origins and Diversification of HIV. Global HIV/AIDS Medicine, 2007: p. 13–21.



16. Fiorentini, S., et al., Functions of the HIV-1 matrix protein p17. New Microbiol, 2006. 29(1): p. 1-10.

17. Chelliah, V., et al., Distinguishing structural and functional restraints in evolution in order to identify interaction sites. J Mol Biol, 2004. 342(5): p. 1487-504.

18. Overington, J., et al., Tertiary structural constraints on protein evolutionary diversity: templates, key residues and structure prediction. Proc R Soc Lond B Biol Sci, 1990. 241(1301): p. 132-45.

19. Frahm N, L.C., Brander C, Identification of HIV-Derived, HLA Class I Restricted CTL Epitopes: Insights into TCR Repertoire, CTL Escape and Viral Fitness. HIV Sequence Compendium, ed. F.B. Thomas Leitner, Hahn B, Marx P, McCutchan F, Mellors J, Wolinsky S, Korber B. 2006/2007: Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, Los Alamos, NM.




204











205


34. O'Brien, T.R., et al., HIV-1 infection in a man homozygous for CCR5 delta 32. Lancet, 1997. 349(9060): p. 1219.

35. Westby, M., et al., Emergence of CXCR4-using human immunodeficiency virus type 1 (HIV-1) variants in a minority of HIV-1-infected patients following treatment with the CCR5 antagonist maraviroc is from a pretreatment CXCR4-using virus reservoir. J Virol, 2006. 80(10): p. 4909-20.

36. Huse, S.M., et al., Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol, 2007. 8(7): p. R143.





206

Appendix I: HIV-1 group M Gag and Pol Trees

Supplementary Figure 1 Phylogenetic History of HIV-1 Group M

Panel A displays a tree that was re-constructed using global-group M gag sequences.

Panel B displays a tree that was re-constructed using global-group M pol sequences. The

‘*’ indicates bootstrap support greater than 90%. The scale bar corresponds to nucleotide

substitutions per site. In panels A and B the bold lettering corresponds to the subtype

designations from the LANL HIV Sequence Database.

207

Appendix II: HIV-1 group O Gag and Pol Trees

Supplementary Figure 2 Phylogenetic History of HIV-1 Group O

Panel A displays a tree that was re-constructed using group O gag sequences. Panel B

displays a tree that was re-constructed using group O pol sequences. The ‘*’ indicates

bootstrap support greater than 90%. The scale bar corresponds to nucleotide substitutions

per site. The bold numbers represent the clusters that correspond to previously proposed

clusters as described in chapter 4.

208

Appendix III: Calculating the Reduction in Breakpoint Occurrence

Supplementary Figure 3 Reduction in Breakpoint Occurrence

The triangles represent the frequency of breakpoints occurring within breakpoint zones of

size 5 or less. The rectangles represent the frequency of breakpoints occurring within zones

of greater than size 5. The slope of the line of best fit through the data within the smaller

breakpoint zones is 0.0153 while for the larger zones the slope of the line of best fit is

0.0415. The α parameter of equations 5 and 6 from chapter 5 is the ration between these

(0.37).

209

Appendix IV: Amino Acid Occurrence within the p17

Supplementary Table 1 Probability of Amino Acid Occurrence within the p17 gene

210

Appendix V: TP and FP rates for the Reduction Model

Supplementary Table 2 TP and FP rates for the Model Described in Figure 6.1

the diversity of hiv-1 - phylogenetic treephylogenetictrees.com/pdf/phd.pdfthe diversity of hiv-1 a...

Documents