the diversity of hiv-1 - phylogenetic treephylogenetictrees.com/pdf/phd.pdfthe diversity of hiv-1 a...
TRANSCRIPT
The Diversity of HIV-1
A thesis submitted to The University of Manchester for the Degree of
PhD
in the Faculty of Life Sciences
2008
John Patrick Archer
2
List of Contents
List of Figures .............................................................................................. 7
List of Tables................................................................................................ 9
List of Abbreviations ................................................................................. 10
Abstract....................................................................................................... 11
Declaration.................................................................................................. 12
Acknowledgements ................................................................................... 14
Chapter 1: Biological Aspects of HIV Evolution ..................................... 15
What is HIV ................................................................................................. 15
Phylogeny of HIV-1 .................................................................................... 18
The Origins of HIV...................................................................................... 28
Recombination ........................................................................................... 33
Dealing with Global HIV-1 Diversity ......................................................... 38
Co-receptor Usage..................................................................................... 41
Remaining Chapters .................................................................................. 46
References.................................................................................................. 49
Chapter 2: Sequence Data and Phylogenetic Trees ............................... 67
Molecular Phylogeny ................................................................................. 67
Global Alignments ..................................................................................... 68
3
Local Alignments ....................................................................................... 71
Measuring Genetic Change....................................................................... 72
Phylogenetic Trees .................................................................................... 75
Neighbour Joining ..................................................................................... 76
Maximum Likelihood Trees....................................................................... 79
References.................................................................................................. 81
Chapter 3: CTree - comparison of clusters between phylogenetic trees
made easy................................................................................................... 84
Abstract....................................................................................................... 84
Introduction ................................................................................................ 85
Novel Features ........................................................................................... 85 (i) Heuristically Defining Clusters .................................................................................................. 85 (ii) Manually Defining Clusters....................................................................................................... 86 (iii) SDR And SDV.......................................................................................................................... 86 (iv) Working with Random Trees.................................................................................................... 87 (iv) Finding the Center of the Tree (COT) ...................................................................................... 87
The Heuristic clustering Algorithm .......................................................... 87 (i) The Explore Phase ...................................................................................................................... 87 (ii) The Selection Phase................................................................................................................... 88
Standard Features...................................................................................... 90
Typical Usage............................................................................................. 90
Acknowledgements ................................................................................... 91
References.................................................................................................. 92
Chapter 4: Understanding the Diversification of HIV-1 groups M and O
..................................................................................................................... 93
4
Abstract....................................................................................................... 93
Introduction ................................................................................................ 94
Methods ...................................................................................................... 96 Tree Metrics..................................................................................................................................... 96 Datasets............................................................................................................................................ 97 Phylogenetic Analysis ..................................................................................................................... 98 Random Trees.................................................................................................................................. 99
Results ........................................................................................................ 99
Discussion................................................................................................ 104
Acknowledgements ................................................................................. 110
References................................................................................................ 111
Chapter 5: Prediction of HIV-1 Recombination Breakpoint Location . 115
Abstract..................................................................................................... 115
Introduction .............................................................................................. 116
Results ...................................................................................................... 120 Model Parameters .......................................................................................................................... 120 In Vitro Breakpoint Predictions..................................................................................................... 122 Global (in vivo) Breakpoints ......................................................................................................... 125
Discussion................................................................................................ 126
Methods .................................................................................................... 129 In Vitro Recombinant Breakpoints................................................................................................ 129 In Vitro Breakpoint Distributions.................................................................................................. 129 Predicted Recombinant Breakpoints ............................................................................................. 130 Random Breakpoint Distributions ................................................................................................. 131 Breakpoint Distributions incorporating Mismatch Influence........................................................ 132 Model Breakpoint distributions ..................................................................................................... 132 Global CRFs .................................................................................................................................. 132
References................................................................................................ 134
5
Chapter 6: A Strategy for Identifying Significant HIV Sequence Diversity Using Structural Constraints .................................................................. 138
Abstract..................................................................................................... 138
Introduction .............................................................................................. 139
Results ...................................................................................................... 141 Predicting viral evolution .............................................................................................................. 141 Using predictions to inform vaccine design .................................................................................. 146
Discussion................................................................................................ 150
Methods .................................................................................................... 153 The model ...................................................................................................................................... 153 Data sets......................................................................................................................................... 154 The coverage algorithm ................................................................................................................. 154
Acknowledgements ................................................................................. 156
References................................................................................................ 157
Chapter 7: Detection of Low Frequency CXCR4-Using HIV-1 with Ultra-deep Pyrosequencing.............................................................................. 162
Abstract..................................................................................................... 162
Introduction .............................................................................................. 163
Methods .................................................................................................... 169 Datasets.......................................................................................................................................... 169 Day 1 and Day 11 datasets ............................................................................................................ 169 Database entropy and amino acid usage........................................................................................ 172
Results ...................................................................................................... 172
Discussion................................................................................................ 181 Future Developments..................................................................................................................... 185
Acknowledgements ................................................................................. 188
6
References................................................................................................ 189
Chapter 8: Final Discussion.................................................................... 194
Conclusion................................................................................................ 201
References................................................................................................ 202
Appendix I: HIV-1 group M Gag and Pol Trees...................................... 206
Appendix II: HIV-1 group O Gag and Pol Trees..................................... 207
Appendix III: Calculating the Reduction in Breakpoint Occurrence ... 208
Appendix IV: Amino Acid Occurrence within the p17 .......................... 209
Appendix V: TP and FP rates for the Reduction Model........................ 210
(Word Count: 48, 258)
7
List of Figures
Figure 1.1 Impact of HIV-1........................................................................ 16
Figure 1.2 Biology of HIV-1....................................................................... 17
Figure 1.3 Phylogenetic Topology of HIV-1 Group M ............................... 19
Figure 1.4 Epicentre of the HIV-1 Group M Pandemic ............................. 22
Figure 1.5 Circulating Recombinant Forms............................................... 24
Figure 1.6 HIV-1 Group O......................................................................... 27
Figure 1.7 Cross Species Transmission ................................................... 30
Figure 1.8 Recombinant origin of SIVcpz ................................................. 32
Figure 1.9 Non Random Recombination................................................... 36
Figure 1.10 Four Phylogenetic shapes ....................................................... 41
Figure 1.11 How Pyrosequencing Works.................................................... 46
Figure 2.1 Scoring Matrix for Aligning Two Sequences ............................ 70
Figure 2.2 Recursion and Trace-back....................................................... 71
Figure 2.3 Example of a Phylogenetic Tree.............................................. 76
Figure 2.4 Bootstrap Analysis ................................................................... 78
Figure 3.1 Interface for CTree................................................................... 86
Figure 3.2 Clustering Sequences on a Tree ............................................. 89
Figure 4.1 Subtype Diversity Ratio and Subtype Diversity Variance ........ 97
Figure 4.2 Phylogenetic History of HIV-1 Group M and Group O ........... 100
Figure 4.3 Subtype Diversity Ratio Distributions..................................... 101
Figure 4.4 The Center of the Group M Pandemic................................... 104
Figure 4.5 Model of HIV-1 Group M Subtype Emergence ...................... 108
Figure 5.1 Prediction of Recombinant Breakpoints................................. 119
Figure 5.2 Effects of Sequence Identity on Breakpoint Location ............ 122
Figure 5.3 Predicted Breakpoint Distributions for gp120 ........................ 124
Figure 5.4 Predicted Breakpoint Distributions the for Entire Genome .... 125
Figure 6.1 Predicting HIV Evolution at Individual Sites........................... 140
Figure 6.2 True Positives, False Positives and True Negatives ............. 143
Figure 6.3 Amino Acid Frequencies within P17 ...................................... 144
8
Figure 6.4 Distribution of Nine-mers in Relation to Coverage................. 147
Figure 6.5 Generation of Optimised Sequence Constructs .................... 148
Figure 6.6 Sequence Coverage .............................................................. 150
Figure 7.1 Sequence logos of the CCR5 and CXCR4-using viruses...... 164
Figure 7.2 Cell Entry Inhibition ................................................................ 165
Figure 7.3 Generation of a Flowgram ..................................................... 166
Figure 7.4 Protocol for Handling Pyrosequenced Data........................... 168
Figure 7.5 Segment Length, Alignment Score and Frequency ............... 173
Figure 7.6 Nucleotide Coverage Across gp160 ...................................... 174
Figure 7.7 Shannon Entropy Across gp160 ............................................ 176
Figure 7.8 Phylogenetic Tree of Day 1 and Day 11 V3 Segment Data... 181
Figure 7.9 Updated Protocol ................................................................... 187
Figure 8.1 Novel Approach to Generating Optimized Constructs ........... 196
9
List of Tables Table 7.1 Rates of Insertion and Deletion............................................... 175
Table 7.2 HIV-1 Phenotypes Counts ...................................................... 177
Table 7.3 Alternate Phenotype Counts ................................................... 179
10
List of Abbreviations
BLOSUM Blocks Amino Acid Substitution Matrices
COT Center Of Tree
CRF Circulating Recombinant Form
CTL Cytotoxic T Lymphocyte
dn Non Synonymous
DRC Democratic Republic Of Congo
ds Synonymous
FN False Negative
FP False Positive
HAART Highly Active Antiretroviral Therapy
HIV Human Immunodeficiency Virus
HPS Homopolymeric Stretch
MHC Major Histocompatibility Complex
NSI Non – Syncytium Inducing
PAL Phylogenetic Analysis Library
PAM Percent Accepted Mutation
PSSM position-specific scoring matrices
RT Reverse Transcriptase
SDR Subtype Diversity Ratio
SDV Subtype Diversity Variance
SI Syncytium Inducing
SIV Simian Immunodeficiency Virus
TN True Negative
TP True Positive
URF unique Recombinant Form
11
Abstract
The phylogeny of the Human Immunodeficiency Virus type 1 (HIV-1) is
characterized by extensive diversity. However when it comes to the
persistence of the virus not all of this diversity is relevant. In this thesis I
show that the significance of diversity, both in relation to epidemiology and
vaccine design, varies within and between the HIV-1 groups. The
substructure present within group O and at the center of the group M
pandemic was observed to be similar to that found on random tree
topologies, whereas the substructure present within globally sampled group
M strains was significantly different. Next an algorithm for generating
artificial sequence constructs from this diversity, while maximizing the
inclusion of “meaningful” potential epitopes, was developed and analyzed.
The utilization of “meaningful” diversity along with the characterization of the
significance of diversity, will have profound effects on the future of vaccine
development and control strategies. Recombination contributes significantly
to the diversity present within viruses classified HIV-1. Similar to the
extensive diversity caused by point mutations and indels, I show that not all
recombinant breakpoints are important in relation to the persistence of the
virus. Many recombination events are as a result of high sequence identity
between RNA templates – which encourages template switching during
reverse transcription. When documenting the locations of recombinant
breakpoints, their importance in relation to viral persistence should be taken
into account. Within the final research chapter of this thesis I developed a
novel protocol for detecting low frequency variants within viral populations
using pyrosequenced data. To summarize, the aim of this thesis was to
characterize the significance of HIV-1 diversity, as well as to develop novel
approaches to make this diversity both more understandable and tractable in
terms of intervention strategies.
12
Declaration No part of this thesis has been submitted in support of an application for any
degree or qualification of The University of Manchester or any other
University or Institute of learning.
Submission in Alternative Format
This thesis has been submitted in alternative format with permission from the
Faculty of Life Sciences Graduate Office. A summary of the chapters can be
found at the end of chapter 1.
13
Copyright Statement
The author of this thesis (including any appendices and/or schedules to this
thesis) owns any copyright in it (the “Copyright”) and s/he has given The
University of Manchester the right to use such Copyright for any
administrative, promotional, educational and/or teaching purposes.
Copies of this thesis, either in full or in extracts, may be made only in
accordance with the regulations of the John Rylands University Library of
Manchester. Details of these regulations may be obtained from the
Librarian. This page must form part of any such copies made.
The ownership of any patents, designs, trade marks and any and all other
intellectual property rights except for the Copyright (the “Intellectual Property
Rights”) and any reproductions of copyright works, for example graphs and
tables (“Reproductions”), which may be described in this thesis, may not be
owned by the author and may be owned by third parties. Such Intellectual
Property Rights and Reproductions cannot and must not be made available
for use without the prior written permission of the owner(s) of the relevant
Intellectual Property Rights and/or Reproductions.
Further information on the conditions under which disclosure, publication and
exploitation of this thesis, the Copyright and any Intellectual Property Rights
and/or Reproductions described in it may take place is available from the
Head of School of (insert name of school) (or the Vice-President) and the
Dean of the Faculty of Life Sciences, for Faculty of Life Sciences’
candidates.
14
Acknowledgements This thesis is dedicated to my supervisor David L. Robertson who provided
continuous ideas, advice, insight, discussion and support in a patient and
devoted manner.
I would like to thank the following people for academic support and
discussion: Kathryn Else, Simon Lovell, Marilyn Lewis, John Pinney, Vicki
Kelly, Nick Gresham, Adam Huffmann, Simon Williams, Etienne Simon-
Loriere, James Eales, Jonathan Dickerson, Jamie MacPherson, Jun Fan,
Julie Huxley–Jones, Raquel Linheiro and all staff and students of the
bioinformatics corridor.
I would also like to thanks Sara-Jane, Joel, Molly, Sorcha and my parents,
15
Chapter 1: Biological Aspects of HIV Evolution
What is HIV? The human immunodeficiency virus (HIV) is the causative agent of Acquired
Immune Deficiency Syndrome (AIDS). There are two phylogenetically
distinct types of HIV referred to as HIV-1 and HIV-2. A characteristic of HIV
infection is the relatively long infection period within the host – which in the
absence of treatment is followed by eventual progression to AIDS. The long
infection period in conjunction with population demographics has lead to
HIV-1s unique epidemiology [1] that affects over 33.2 million individuals [2]
worldwide. In 2007 with over 2.5 million people being newly infected and
another 2.1 million succumbing to the disease it is clear that much hardship
and misery is still being caused since its entry into the human population
during the first half of the twentieth century [3]. Over two thirds of infection
occurs in sub-Saharan Africa (Fig. 1.1) where access to treatment is often
non existent. Without access to long term drug treatment the mortality rate
is near 100%.
The virus itself belongs to the group of retroviruses known as lentiviruses [4].
These are spherical in shape, roughly 80 – 100 nm in diameter and possess
two usually identical copies of a single stranded RNA genome approximately
10kb in length [5]. Following reverse transcription the genome must
integrate into the host cells DNA in order to replicate (Fig. 1.2, panel A). The
lifecycle within an individual host is characterized by exceptionally high
replication [6, 7], mutation [8] and recombination [9, 10] rates which
combined with the positive selection, promoted by the hosts immune
response [7, 11, 12], result in the huge amount of diversity observed at both
an intra – [13] as well as at an inter – host level [14]. This enormous amount
of diversity is responsible for the long term persistent infection of individual
16
hosts as well as for the current lack of a preventative or therapeutic vaccine
[15].
As with all retroviruses the genome of HIV (Fig. 1.2, panel B) possesses
three major coding regions: gag – whose products are the internal viral
proteins such as those needed for the matrix (p17), the capsid (p24) and
nucleocapsid (p7) structures, pol – whose products are the enzymes
responsible for reverse transcription (reverse transcriptase) and integration
(integrase) and env – whose products consist of two types of envelope
protein gp120 and gp41 [5]. GP120 mediates virus attachment and entry
into host cells via interaction with host cell receptors. GP41 mediates
membrane fusion. Other smaller accessory genes are: vif - which
suppresses host factors (APOBEC3G and APOBEC3G) that inhibit infection,
vpr – which enhances post cell entry infectivity, vpu - which is only found in
HIV-1, vpx – which only found in HIV-2, tat – which is involved in activation
viral transcription, nef – which down regulates CD4/MHC expression and rev
– which induces nuclear export of viral RNA [5]. 5’ and 3’ LTRs mark the
ends of the genome.
Figure 1.1 Impact of HIV-1
Distribution of the impact of HIV on the global population. Figures taken from [2].
17
Figure 1.2 Biology of HIV-1
(A) The life-cycle of HIV within a host cell is represented by the black arrows. The red lines
represent viral RNA while the blue lines represent the reverse transcribed viral DNA. On
initial contact with a host cell membrane gp120 spikes on the surface of the virus must bind
to specific host receptors and co-receptors. The primary host cell receptor that gp120 binds
to is CD4. T-lymphocytes and cells of monocyte macrophage lineage both express the CD4
receptor. Secondary co-receptors, such as CXCR4 and CCR5, will be discussed in detail
under the section Co-receptor Usage and Cellular Tropism. After membrane fusion viral
RNA is reverse transcribed into DNA by viral reverse transcriptase – during this process
copy-choice recombination may occur. Viral DNA can then enter the host cell nucleolus
where it is incorporated into the hosts DNA by viral encoded integrase. Transcription of viral
RNA then takes place. Some of the RNA transcribed will be complete genomic RNA that
18
will be used within new virons while more will be used in the translation of viral proteins
required to package the genomic RNA. After packaging of genomic viral RNA the new virus
particle leaves the host cell. (B) Genome structure of HIV-1 (http://www.hiv.lanl.gov/). The
gene highlighted red is the HIV-1 specific vpu gene.
Phylogeny of HIV-1
Strains classified as HIV-1 fall into three distinct groups. These are labelled
M (Major), O (Outlier) and N (New or non M/non O) [16]. Group M is almost
entirely responsible for the global pandemic. The diversity present within the
group is extensive when compared to other rapidly evolving viral genomes
such as influenza [17]. When represented on a phylogenetic tree strains
within group M form well defined clusters. Nine of these clusters are
currently termed subtypes. These are labeled A to D, F to H, J and K [16]
and are consistent in their phylogenetic topology in relation to each other
regardless of the section of their genome being compared [14]. These
subtypes, supported by a low Subtype Diversity Ratio (SDR) [18] (Fig. 1.3,
panel A) as well as high bootstrap values [14], are roughly equidistant to
each other when represented on a phylogenetic tree (Fig. 1.3, panel B).
They also have long characteristic evolutionary branch lengths stretching
into them. As a result of these combined characterises HIV-1 group M’s
phylogeny is often described as being “double starburst” like in nature.
Other significant clusters are formed by Circulating Recombinant Forms
(CRFs). These have arisen as a result of recombination events between
divergent HIV strains within individual hosts (http://www.hiv.lanl.gov). To
qualify as a CRF three epidemiologically unlinked viruses with the same
mosaic genome and consistent phylogenetic clustering must be
characterized [16]. The nine subtypes are often referred to as being pure
recombinant free lineages. However this is misleading as recombination is
an important part of HIV’s life cycle [9, 10, 19]. The subtypes have emerged
19
through a unique epidemiological history [14] and their classification has
been largely based on global sampling bias. Recently Abecasis et al.,
observed that subtype G is actually a recombinant lineage involving
subtypes A and J as well as the designated recombinant lineage CRF02_AG
[20]. Thus despite the rigid classification system the phylogeny of HIV
should be viewed as being highly dynamic in nature.
Figure 1.3 Phylogenetic Topology of HIV-1 Group M
(A) Neighbour joining phylogenetic tree re-constructed using global-group M envelope
gp160 sequences sampled from the LANL HIV Sequence Database. The bold lettering
corresponds to the subtype designations from the Database. * represent bootstrap values
greater than 90%. (B) The relationship between the SDR and the quality of clustering on a
phylogenetic tree as defined in [18].
20
As for any genome, different regions of the HIV genome have different rates
of evolution. Between the current nine subtypes env amino acid sequences
are separated by approximately 25 to 35% [21] while for gag coding
sequences the separation is around 14% [22]. It has been suggested that
the estimated 1% mutation rate per year within the env gene has separated
any two HIV group M subtypes by at least 30 years [22]. Although mutations
generate novel strains, which can potentially become the roots of new
lineages their fixation into longer term HIV-1 subtypes depends on more
complicated epidemiological factors. Grenfell et al., suggested that the long
infection period of HIV-1 is responsible for the slow changing inter host
phylogenetic topography [1]. Such a long infection period allows for
individual risk groups to be replenished over time by new potential hosts.
Thus the inter host phylogeny of HIV reflects the behavioural, demographic
and epidemiological history of transmission rather than immune selection
alone, which is more responsible for the rapidly changing intra host viral
phylogeny. This is because natural selection of individual strains becomes
obsolete during transmission of the virus between individuals within such risk
groups due to the spread of the virus being largely dependent on the
behavioural aspects of individuals, size of risk population and time - and not
on the fitness of the individual viral strain. With the lack of natural selection
genetic drift has a large influence on the slow evolution of the current
lineages.
The Democratic Republic of Congo (DRC) has been proposed to be the
epicentre of the HIV-1 group M pandemic [23]. Uniquely to the region, with
the exception of subtype B [24], it was possible to identify strains from each
of the current global subtypes. A high degree of intra and inter subtype
diversity was also observed along with many unidentified strains [18, 23, 25].
Using the SDR it was shown by Rambaut et al., that group M strains from
within the DRC region had very little organized substructure when compared
to their global counterparts [18] (Fig. 1.4, Panel A). In a tree consisting
21
solely of strains isolated from the DRC region the double starburst shape
that is characteristic of the global group M phylogeny is no longer present
(Fig. 1.4, panel B). On the global tree (Fig. 1.3, Panel A) the well defined
clusters are as a result of chance exportations of DRC strains – with
resulting bottleneck effect – into new susceptible global risk groups followed
by subsequent diversification [18]. These founder effects are what gave rise
to the long evolutionary branches that extend into each of the subtypes [26].
As a result the global subtype based classification system does not reflect
the extent of true diversity present within the epicentre of the pandemic.
This will be the subject of chapter 4. Within other highly evolving species
such as influenza [17] a less than 2% change in the amino acid sequence
can cause failure in the cross-reactivity of the polyclonal response to the
influenza vaccine. Despite this, HIV-1 strains within the DRC are classified
into subtypes according to their proximity on a phylogenetic tree to their
global counterparts. Vaccine design strategies are often based on these
subtypes [27, 28]. This would have far reaching and potentially disastrous
consequences within the region as such vaccines could not protect the
population against the diversity of strains present.
22
Figure 1.4 Epicentre of the HIV-1 Group M Pandemic
(A) Maximum-likelihood phylogenies were estimated using HIV-1 sequences obtained from
within the Democratic Republic of Congo and HIV-1 global isolates. Given a phylogeny with
tips labelled according to subtype, the SDR was calculated (red arrows). A null distribution
of SDR values (blue) was obtained by simulating random phylogenies under a model of
exponential growth. (Figure and legend modified from [18]). (B) Neighbour joining
phylogenetic tree re-constructed using sequences sampled solely from within the DRC.
Sequences were obtained from [23, 25, 29]. The bold lettering corresponds to the subtype
designations from the Los Alamos HIV-1 sequence database.
23
Thirty four CRFs are currently known to exist within group M with CRF
01_AE (South East Asia), CRF 02_AG (West and central Africa), CRF
06_CPX (Eastern Europe) and CRF 11_CPX (South America) being the
most prevalent, accounting for 8.4% of the represented strains
(http://www.hiv.lanl.gov). This number is steadily increasing with newly
emerging CRFs frequently being discovered. In contrast to the nine
subtypes, with the exception of subtype G [20], their phylogenetic
relationship with each other is dependent on the region of the genome being
examined [19] as strains falling into these clusters are comprised of mosaic
genomes that originated from different ‘parent’ clusters (Fig. 1.5, panels A
and B).
Li et al., observed the first plausible case of recombination when an isolate
from the DRC designated MAL was found to have phylogenetic similarities
with other strains that could only be explained by recombination [30]. The
first detected case of recombination occurring between the subtypes B and F
was documented by Sabino et al., [31]. Robertson et al., produced evidence
that frequent recombination in fact had taken place between each of the
subtypes that were known at that time [19]. More recently it has been
suggested that in the early days of the pandemic at least 37% of the viruses
circulating in Central Africa were recombinant forms [25]. Osmanov et al.,
showed that in the year 2000 18% of HIV-1 infections globally were caused
by CRFs [32]. Further evidence that CRFs contribute a major part to the
diversity of the worlds HIV-1 pandemic can be found in [33-37].
Many unique recombinant forms (URFs) also exist (http://www.hiv.lanl.gov).
These arise in a similar way to the CRFs when two or more strains undergo
recombination within an individual host to form a mosaic genome. However
in contrast to CRFs they have not spread beyond their initial host [38].
Recent work has shown that globally URFs exist at very high frequencies
24
[39], for example out of nine samples taken from young adults in Southwest
Tanzania five were URFs between subtypes A and C [40].
It has also been estimated that up to 15% of viral genomes within an
individual who has not been dually infected have been altered by
recombination events that occurred after the initial exposure to the virus [13].
From the number of recombinant forms contributing to the global pandemic
at both an intra- and inter- host level it is clear that recombination contributes
a great deal to the diversity present within the phylogeny of HIV-1 group M
and thus is a major factor influencing diversity that must be considered when
developing any future vaccine.
A
B
Figure 1.5 Circulating Recombinant Forms
(A) The genome organisation of CRF03_AB was isolated in Kaliningrad, and is circulating in
Russian and Ukrainian cities. It is thought that the CRF03_AB epidemic in Kaliningrad
started from a single source arising from a recombination event occurring between the
subtype A strain virus prevalent among IDU in some southern Commonwealth of
Independent State countries, and a subtype B strain of unknown origin [41]. (B) The
genome organisation of CRF06_cpx which has been found to be circulating in Senegal,
Mali, Burkina, Ivory Coast, Nigeria, France and Australia [42]. It illustrates how CRFs can
also be formed by recombination events involving other CRFs.
Group O was first classified as a new HIV-1 group by Charneau et al., after
the discovery that an isolated strain from a French woman, designated HIV-
25
1VAU, was highly divergent from other strains constituting group M [43].
This divergent strain was found to be more closely related to two other
recently discovered divergent strains, HIV-1ant70 [44] and HIV-1MVP5180
[45]. The group has a 30 - 50% sequence divergence from group M
depending on the genes being compared [46]. For unknown reasons the
group has largely remained endemic to Cameroon where it is responsible for
1% of HIV-1 infections in the far north to 6.3% in the capital [47]. Yamaguchi
et al., show that in the Northwest of the country the prevalence of the group
may be even lower where it was observed to be responsible for only 0.4% of
HIV-1 infections [48]. Despite the low prevalence cases have been isolated
in Gabon, Equatorial Guinea, Nigeria, Benin, Ivory Coast, Togo, Senegal,
Niger, Chad, Kenya and Zambia as well as isolated cases in France,
Germany, Belgium, Spain, Norway and the United States where it is thought
that immigration is responsible for its introduction [49, 50]. There is no
evidence to support that the introduction of group O in these latter European
countries has spread epidemiologically beyond single individuals.
When group O strains are represented on a phylogenetic tree (Fig. 1.6,
Panel A) they lack the long arm branches reaching into distinctive clusters as
is characteristic of the group M phylogeny. Since the group has been
around for a similar length of time to that of the much more prevalent group
M [50-52] could the group’s phylogeny reflect the extent and pattern of
divergence found within the latter as initial studies would suggest [53, 54]?
Roques et al., proposed that after a phylogenetic study involving forty nine
group O strains there were no distinct clades that were equivalent to the
global group M subtypes [50]. Three broad phylogenetic clusters were
observed based on an analysis of gag and env sequences but, unlike the
subtypes of group M, the apparent clustering appeared to be very weak. It
was suggested that the clustering represented the local transmission
network of infected individuals rather than distinctive subtypes.
26
Yamaguchi et al., however, using a simplified tree representing the gp160
region (Fig. 1.6, Panel B), suggested that there was definite evidence to
support the identification of five possible clades (A, B, C, D and E) and a
further three possible sub clades within clade A (A1, A2 and A3) [55]. They
proposed that group O did have equivalent clades to the group M subtypes.
Further support of four of the five proposed clades (clades A, B, C and E)
was provided in a study analysing twenty three full length HIV-1 group O
isolates [46]. Clade A was seen to have the highest prevalence with 60% of
group O isolates to date falling within it. It was also unexpectedly shown that
13% of the full length genomes to be examined had evidence of
recombination. This high proportion of recombinant genomes is unexpected
due to the low prevalence of group O. For a recombinant form to arise a
dual infection must occur [56] and with the very low numbers of group O
infections [57] this seems like it should not be a very frequent event – with
the exception of strains involved in local transmission networks. Each of the
recombinant forms found had segments that were from clade A. The
previous clade D was found to be made up of the recombinant forms from
clade A and unclassified group O strains. Formal classification was not
possible as additional full length genomes are required to complete the clade
definition – although according to the authors this is just a matter of time.
To date HIV-1 group N is the rarest form of the virus and has only been seen
in Cameroon [58]. With the addition of strain 02CM-DJO0131 from [59]
there are presently only three near full length sequences available for this
group.
27
Figure 1.6 HIV-1 Group O
(A) Neighbour joining tree all the HIV-1 group O gp160 sequences found within the Los
Alamos HIV database. The bold letters corresponds to the cluster designations from [55]. *
represent bootstrap values greater than 90%. (B) “Simplified” phylogenetic trees group O
and group M env gp160 as described in [55]. Bootstrap values are shown for only the major
branches. The inner circle (diameter, 0.045 distance unit) represents the origin from which
the clusters radiate, whereas the outer circle represents the extent of genetic divergence
(diameter, 0.21 distance unit). The circles were constructed for the group M tree and then
superimposed on the group O tree. For group O roman numerals I to V correspond to
clades A to E. (Panel B taken from [55]).
28
The Origins of HIV
There are at least 36 distinct lentiviruses that infect African primates [60].
With only one exception, isolated from captive rhesus macaques, [61] all
have been found within African apes and monkeys [62]. The exception is
thought to have acquired SIV in captivity by cross-species transmission from
a SIV-infected African primate [61]. Five equidistant phylogenetic lineages
based on phylogenetic analysis of full-length pol protein sequences are
described in Hahn et al., [63]. These lentiviruses are referred to as simian
immunodeficiency virus (SIV) and in their natural primate hosts appear to
cause no adverse effects [63]. Viruses from individual primate species have
been observed to be more closely related to each other than to viruses from
a different species. For example in the four sub species of African green
monkeys (Chlorocebus aethiops, C. pygerythrus, C. sabaeus, and C.
tantalus) form monophyletic clusters that each contain strains that are more
closely related to each other than to the SIVs from the other clusters [63-65].
This along with the lack of virulence in the primate host could convincingly
suggest host dependent evolution [63].
However genetic diversity studies indicate that the African Green Monkey
(AGM) clade is millions of years old [66] while the most recent common
ancestor of SIVagm is far younger [67]. A resemblance between host and
pathogen phylogenies could have arisen as a result of preferential host
switching followed by subsequent diversification [68]. In fact Wertheim and
Worobey observed that AGM mitochondrial DNA and SIVagm sequences did
not share a phylogenetic topology that would coincide with long term co-
evolution [69]. Similar patterns of clustering can be observed in the two
species of chimpanzee that harbour these lentiviruses. Viruses isolated from
the chimpanzee Pan troglodytes troglodytes cluster together in the presence
of SIVs isolated from other primate species whereas the SIV isolated from
Pan troglodytes schwinfurthii falls outside of the p.t. troglodytes cluster [63,
29
70, 71]. Once again this would suggest that this pattern has arisen due to
preferential host switching [68] and not as a result of long term co-evolution.
The first full length SIV observed to have the same genetic structure as HIV-
1 was presented in [72]. This strain was referred to as SIVcpz-gab. The
organisation of the genome was found to be 5’ gag-pol-vif-vpr-tat-rev-vpu-
env-nef 3’. Specifically there was the presence of a HIV-1 specific vpu gene
and the absence of vpx gene, common to HIV-2 and most other SIVs. In
[73] a second complete genome of a chimpanzee lentivirus, referred to as
SIVcpz-ant, was obtained and found to have a similar genetic organization to
SIVcpz-gab. From protein sequence comparisons SIVcpz-ant was found to
be a closer relation to SIVcpz-gab and to HIV-1 isolates than to members of
the other four major phylogenetic lineages of these primate lentiviruses - but
as an out-group. Sequence identity with SIVcpz-gab was fairly low and
ranged from 72% (pol), 48% (env) and 25% (vpu). Phylogenetic analysis
revealed the possibility that groups O and M emerged as a result of separate
cross species transmission events [71]. The ancestral host primate species
that gave rise to HIV-1 was still unclear however as there appeared to be
very few chimpanzees naturally infected with SIVcpz [74] and thus there was
the possibility that both chimpanzees and humans could have acquired the
virus from a third reservoir species – for example Gorilla gorilla gorilla [75].
After isolating a third full length SIV strain, SIVcpzUS, and determining, by
mitochondrial DNA analysis the subspecies identity of all known infected
chimpanzees, Gao et al., observed that one subspecies of chimpanzee, P.t.
troglodytes, was the most probable natural ancestral host and reservoir for
HIV-1 [71]. More recently [76-78] have confirmed that P. t. troglodytes is
the most probable ancestral host for HIV-1 and that groups M, O and N have
resulted from three separate cross species transmission events (Fig. 1.7).
Keele et al., reported, using newly developed sampling techniques, that up
to 35% of individuals in some communities of wild living P. t. troglodytes
30
were infected by the virus [77] – a number that is higher than was previously
thought.
In relation to the rare group N, the origins were more obscured as the gag,
pol and the 3’ end of the vif gene was similar to HIV-1 group M while the env,
nef and 3’ end of vif gene was closely related to SIVcpzUS (Fig. 1.7). It has
been suggested by Gao et al., that the predecessor to this group, first
identified from a strain called YBF30, was created after a recombinant event
between two divergent strains of SIVcpzs within the primate host before its
jump in to the human population [71]. Strong evidence in support of this has
subsequently been provided [70, 79-81]. Many SIV strains are in fact
recombinant strains from earlier ancestors [82-84].
Figure 1.7 Cross Species Transmission
Maximum likelihood tree displaying the relationship between the three groups of HIV-1 and
their SIV relatives across two different regions of the genome. Sequences were obtained
31
from the Los Alamos HIV sequence database. The * represent separate cross-species
transmission events.
The predecessor to group M (SIVcpzptt) appears to have also been a
recombinant strain between two monkey SIVs – red-capped mangabeys
(sivrcm) and greater spot-nosed monkeys (sivgsn) [60, 85]. Each of these
SIVs displays similarities to SIVcpzptt but at different locations across the
genome (Fig. 1.8, Panels A and B). SIVgsn was the first monkey virus found
to possess the vpu gene and its 3’ half of the genome was observed to be
closely related to SIVcpzptt [86] while the 5’ end of the SIVcpzptt genome
was found to be closely related to SIVrcm [70, 87]. Both SIV from red-
capped mangabeys and greater spot-nosed monkeys could have ended up
within a single chimpanzee host as it is known that chimpanzees hunt
smaller monkeys for food in overlapping geographical locations within
central Africa overlap [88, 89]. After the divergence of P. t. schwinfurthii and
P. t. troglodytes Leitner et al., suggested that there is evidence for
subsequent SIV superinfection followed by further recombination events
within both lineages – resulting in the potential for further differences in their
evolutionary histories than was previously thought [90]. Group O strains also
show evidence of ancient recombination events between SIV strains within
their ancestral primate hosts [91].
When a virus enters a new host species successfully for the first time there
can be a wide range of different outcomes ranging from incidental infection
to epidemic spread [63]. Despite group M’s common ancestor existing within
the human population at a similar time to the common ancestor of group O
[3, 50, 52], the prevalence [47, 48] and geographical location [49] of the
latter remains highly limited. Group O’s prevalence has actually been
gradually decreasing [57]. Reasons for group M’s success in relation to
group O are poorly understood but initial studies suggest that group O may
have a reduced replicative and transmission fitness [92]. After a cross
species transmission the viral population must adapt to a new genetic and
32
immunologic environment [93]. The mechanism of adaptation is in the from
of alterations in its amino acid sequence which result in: (i) altering the
efficiency of cell entry, (ii) blocking interactions with detrimental host proteins
and (iii) promoting escape from the immune system [94].
Figure 1.8 Recombinant origin of SIVcpz
(A) Maximum likelihood phylogenies of primate lentiviruses pol and env sequences. The
close genetic relationship of SIVcpz to SIVrcm from red-capped mangabeys in pol, and to
SIVgsn, SIVmus, and SIVmon from greater spot-nosed, mustached, and mona monkeys in
Env, are highlighted in green and magenta, respectively. (B) Schematic diagram of the
genomic organization of SIVcpz. Genomic regions are colored according to their genetic
relationship to SIVrcm (green) or the SIVgsn / SIVmus / SIVmon lineage (magenta).Grey
33
areas in SIVcpz are of unknown origin. Vpu and vpx genes are highlighted. (Diagram and
legend taken from [60]).
Recent evidence of SIVs adapting to new hosts is presented in [94] where
selection within a rhesus macaque host was found to be acting strongly with
the V2 loop after the primate was inoculated from a SIV strain found within
sooty mangabeys. In HIV-1 a host specific adaptation has been observed to
occur at a site within the p17 region [95]. In SIVcpzPtt a methionine (Met) is
present at the site while in each HIV-1 group Arginine (Arg) is present. On
passage of the human virus through chimpanzees the Arg reverted to Met.
SIV and HIV sequences have substantial differences in their major proteins
but the relevance of these differences to the biology of the viruses is poorly
understood [95]. As well as host specific adaptation other reasons why
group M is far more prevalent than group O may include behavioural,
demographic and epidemiological history of transmission [1].
Hahn et al., proposed that the primate ancestral host of HIV-2 was the sooty
mangabey [63]. Each of the different lineages of HIV-2, termed subtypes A–
F, are due to multiple-cross-species transfer events from sooty mangabeys
to humans [96]. Geographic evidence also supports the theory that HIV-2
originated from the sooty mangabey as the virus is only common in West
Africa – which is where the habitat for these monkeys is found [96]. The
mode of transmission between humans and sooty mangabeys was thought
to have been hunters becoming infected by the latter that they hunt for food
[97].
Recombination
As has already been discussed, recombination occurs frequently within and
between the different HIV-1 groups, subtypes and recombinant forms.
Recombination provides potential benefits that include: the spread of
beneficial mutations throughout the population, increasing the variability
34
present within the population and aiding in the exploration of fitness or
“adaptive” peaks within regions of unexplored sequence space that would be
otherwise be unreachable by a step-by-step series of mutations [98, 99].
Two categories of recombination based on their epidemiological outcome
occur in relation to the HIV genome. These are single infection
recombination and dual infection recombination.
Single infection recombination occurs when recombination takes place
between two viral genomes within a host that has not been super-infected.
For a rapidly evolving genome this type of recombination could play a role in
disease progression within the host [100] as well as evading the hosts
immune response via the generation of new variant strains. Taylor and
Korber estimated that up to 15% of viral genomes within an individual which
has not been superinfected could be a result of intra-host recombination
events [13]. Dual infection recombination occurs when two divergent viral
genomes undergo recombination within a single host cell. Dual infection can
be a result of co-infection or of superinfection. Co-infection is where a
second virus enters the host at about the same time as the primary infection.
Superinfection on the other hand is where re-infection of a host occurs at
some distant point after the primary infection.
The number of CRFs and URFs that are currently in circulation, and indeed
the existence of the HIV-1 virus from groups M, N and O, could be viewed as
substantial evidence that dual infection does occur within natural
populations. However there are numerous publications specifically
documenting the occurrence of superinfection involving viral strains from
different subtypes within individuals [101-105] – and even within different
species of lentivirus such as FIVs [106]. The frequency at which
superinfection occurs globally has not been resolved and most probably
depends on a number of factors including the frequency of prevailing strains
35
within a region, the evolutionary distance between strains, geography and
the demographics of the host population.
Takehisa et al., observed in Cameroon that 3.01% of individuals were
superinfected [107]. One individual was found to be triply infected with HIV-
1 subtypes A and D as well as a group O strain. In Niger, of the 0.87% of
the population found to be HIV positive, 1.5% were infected by more than
one strain of the virus [101]. Hu et al., observed that in Bangkok,
superinfection with CRF01_AE followed 3.9% of subtype B primary
infections while superinfection with subtype B followed 1.5% of CRF01_AE
primary infections. It has been suggested that superinfection between two
closely related viruses is less likely than that of viruses that are more
distantly related [108] – despite the fact that an individual host is probably
more likely to come into contact with a closely related viral strain due to the
epidemiological patterns of the disease. If this is the case then we could
expect to see groupings of closely related viral strains on a phylogenetic tree
along with recombinant strains falling between those groups as the role that
recombination play’s in shuffling viral genomes would become more
important with increasing evolutionary distance.
Within an individual host cell the mechanisms of retroviral recombination
have been well studied. Copy-choice recombination is the most supported
model of how recombinant breakpoints arise [56, 109, 110]. In this model
(Fig. 1.9, Panel A) if reverse transcriptase becomes stalled on the donor
RNA (red) during reverse transcription nascent DNA can hybridize to the
acceptor RNA (blue). The reverse transcriptase complex can then
disassociate from the donor RNA and rebinds to the acceptor RNA. Reverse
transcription will then continue using the new template. Some plausible
alternatives to this model are presented in [111] but all involve strand
switching following a stalling of the reverse transcription complex.
36
It has been observed that breakpoint positions are distributed across the
gp120 region non-randomly, in the absence of natural selection (Fig. 1.9,
Panel B) [112]. This non random distribution of breakpoints has been
observed in vivo across the entire HIV-1 genome [113, 114]. Positioning of
breakpoints is believed to be influenced by a number of mechanistic factors
including high sequence identity [112, 115], secondary RNA structure [109,
116] and the location of runs of identical nucleotides referred to as
homopolymeric stretches (HPS’s) [112, 117]. The latter two increase the
probability of a breakpoint occurring by stalling the reverse transcriptase
complex during DNA synthesis [118-120], which in turn promotes the
induction of strand switching within regions of high sequence identity.
Figure 1.9 Non Random Recombination
(A) Model of copy-choice recombination. The dotted red line represents the growing DNA
strand. The full red line represents the donor RNA. Blue represents the acceptor RNA
sequence. The green dot is some mechanistic feature, such as a homopolymeric run, on
the donor strand that can potentially stall the reverse transcriptase complex (yellow). (B)
37
Distribution of inter – subtype and (C) intra – subtype breakpoints along the gp120. The
constant regions are shaded dark grey. The different parental pairs used are shown on the
left. Black numbers give the number of breakpoints identified within each region, and grey
triangles their approximate position. The total amount of recombinants analysed for each
pair is given on the right, together with the P-values (in grey) for Chi-square tests for a
random distribution. (Panels C and D taken from [112]).
Unsurprisingly, not all recombinant strains generated within the host cell are
viable. Preliminary findings [121] suggest that in an in vitro multiple cycle
system over 75% of HIV-1 recombinants generated in a dual infection do not
have the ability to replicate. Additionally, if copy choice in vivo does result in
a recombinant genome with the ability to replicate, that mosaic genome will
have to survive selective pressure from the host’s immune response.
Differentiating between breakpoints that are under the influence of this
selective pressure from those that are generated based on the
characteristics of the parental sequences is vital to the understanding of
HIV’s evolution in relation to drug resistance [122, 123], generation of
escape mutants [124, 125], disease progression [126] and diversity [19, 25,
32].
From an evolutionary perspective breakpoints that are generated solely
based on the characteristics of the parental sequences and that survive
simply because they are in regions with very weak selective pressure are
fairly unimportant in relation to the virus’s survival within the global
pandemic. Comparing the locations of breakpoints present within viable
recombinant sequences derived from established strains contributing to the
global pandemic [113] to a mechanistic probability of the distribution of
breakpoints in the absence of natural selection would provide an insight into
the selection pressures placed on the HIV genome. This would inevitably
help to identify regions of the genome that have a significant role in the
spread of the virus. This will be the subject of chapter 5.
38
Dealing with Global HIV-1 Diversity The rapid ability of HIV-1 to generate extensive diversity leads to persistent
infection within the host despite the actions of the immune response [127-
129]. The amount of diversity complicates the choice of candidates for a
potential vaccine at either an individual subtype level or that is cross reactive
between multiple subtypes [130]. In other rapidly evolving RNA virus
genomes, such as influenza [17], a less than 2% change in the amino acid
sequence can cause failure in the cross-reactivity of the polyclonal response
to a vaccine.
Within an infected individual, both HIV neutralizing antibodies and cytotoxic
T cells are generated. Thus there are currently two main approaches to
producing a vaccine: those based around the B-cell response and those that
aim to stimulate the T-cell response. Traditionally, vaccines based around
the B-cell response have been more successful although an effective HIV-1
vaccine has not been produced [131]. More recently there is much
discussion of the possibility of a T-cell based vaccine. It has been observed
that in humans the Cytotoxic T lymphocyte (CTL) response plays an
important role in controlling viremia during the course of an HIV-1 infection
[132-135] while in animals the CTL response has an important role in
controlling viremia in relation to SIV infection [27]. It has also been observed
that there is a strong correlation between T cell responses and CD4+ T cell
count [136].
Each T cell has many copies of an identical epitope specific T cell receptor
that recognize fragments of peptides (epitopes) that are complexed with a
major histocompatibility complex (MHC) protein on the target cell membrane.
Epitopes are formed at random by intra cellular proteolytic processing by
proteosomes. In a non infected cell these will be self-peptides while in an
infected cell a proportion will be viral peptides. T lymphocytes themselves
39
fall into two categories: those expressing the CD8 receptor (Tc Cells) and
those expressing the CD4 receptor (Th Cells). CD8+ Tc cells recognize
epitope bound class I MHC proteins while CD4+ Th cells recognize epitope
bound class II MHC proteins. Class I MHC proteins binds with peptides of
between 8 and 10 amino acids in length that are derived from intra cellular
peptides. Class II MHC proteins bind with peptides of between 17 and 22
amino acids in length that are derived from peptides that have been imported
into the host cell. Binding will result in the CTL’s clonal expansion and
differentiation to become a functional T-effector cell. All cells of the body
have MHC I proteins and are thus under constant surveillance by CD8+ T
cells whose main function after activation is to destroy host cells that are
presenting viral peptides. The major function of activated CD4+ T cells is to
regulate the activation of all T and B lymphocytes. The immune response
depends on this help which is why their count can be used to track disease
progression.
The aim of producing a vaccine based on the T cell response is to produce a
polyvalent CD8+ T-cell based vaccine that is broadly immunogenic in
relation to the diversity present. Epitopes used by CD8+ T-cell are on
average nine-residues in length and so will be hereafter referred to as “nine-
mers”. A list of known epitopes is maintained at the Los Almos HIV
sequence database many of which have been characterized in [137]. The
list however is far from complete and basing vaccine design strategies solely
around it could be largely ineffective - although incorporation of these
epitopes into existing strategies could improve results.
Traditionally the main approaches at attempting to improve immunogenic
coverage have focused around the use of consensus sequences, ancestral
sequences, center of the tree (COT) sequences and geographically
clustered sequences [28]. A consensus sequences (Fig. 1.10, yellow) is a
sequence that at each site contains the character that was most common
40
amongst the aligned sequences that it was generated to represent. This
approach for generating an artificial sequence implicitly minimizes the
genetic distance between itself and circulating strains of the virus. The COT
(Fig. 1.10, circle) sequence represents the point on a phylogenetic tree
where the average evolutionary distance to each tip is minimized [138].
COT thus provides coverage that is at least similar to the consensus or
ancestral sequences and it has the added benefit of being implicitly more
similar than the ancestral sequence to viral lineages that have evolved more
rapidly and so may be more useful in relation to asymmetric trees. Use of
such sequences in relation to vaccine design has generated some promising
results, for example recently in [27] a group M env consensus strain (CON6),
based on sequences available form the Los Alamos HIV database in 1999,
was created and it was found to induce a greater number of T-cell epitope
responses than any single wild-type strain (from subtypes A, B and C).
A more novel approach involves the use of algorithms to improve nine-mer
coverage across an alignment in the hope that some of these high covering
nine-mers will be viable epitopes that can be targeted by the hosts immune
response [139, 140]. Generally coverage describes the number of times that
an individual nine-mer exists within a sequence alignment independent of its
location. The local coverage score provided by an individual nine-mer within
a sequence is the coverage that is provided across all of the sequences
within an alignment at the site where the nine-mer occurs. The mean local
coverage provided by the sequence is the average of all the local covering
scores provided by each nine-mer across the sequence.
The general definition of coverage, although currently in use [139, 140],
could potentially favor the use of nine-mers with poor local covering scores
over those with high local covering scores. For example, across a given
alignment there could exist a unique nine-mer permutation that is only
present at a single site. This nine-mer could provide a very high local
41
coverage score. Across other sites, numerous instances of another unique
nine-mer permutation may occur. However, these may all provide very poor
local coverage scores. When combined, the coverage score of the latter
could outweigh the coverage score provided by the instance of the single
high scoring nine-mer. Optimizing the local coverage scores, instead of
selecting nine-mers based on overall coverage score, avoids this situation as
the favorable nine-mer would always be chosen over multiple occurrences of
an unfavorable one. This is the aim of chapter 6 (Fig. 6.5).
Figure 1.10 Four Phylogenetic shapes
Four possible phylogenetic shapes and the resulting reconstructed sequences. The
Ancestor (Anc) and the Center of the Tree (COT) can only fall on an evolutionary path (i.e.,
on a branch of the phylogeny), whereas the consensus (Con) may not. (Figure and legend
taken from [138]).
Co-receptor Usage
HIV-1 viruses can be characterized into two phenotypes referred to as
syncytium inducing (SI) and non – syncytium inducing (NSI) [141]. These
42
phenotypes have different cellular tropisms and appear during different
stages of infection [142]. The macrophage tropic NSI phenotype is
predominant during the early stages of infection [143] while the T cell tropic
SI phenotype is often observed later during and around the time of
progression to AIDS [144]. The gp120 gene can be divided into alternating
constant and variable regions referred to as C1 – C5 and V1 – V5 [145].
The variable regions encode exposed surface loops [146]. Amino acid
sequence variations giving rise to a more positive charge within gp120 at
sites 306 and 320 (sites 11 and 25 of the V3 loop) have been strongly
associated with the more virulent SI phenotype [147-149].
With the discovery of the involvement of primary co-receptors, CXCR4 and
CCR5, in cellular tropism [150, 151] and the mapping of their usage to the SI
and NSI phenotypes respectively [152, 153], the SI and NSI phenotypes are
now often referred to as X4 (CXR4 tropic), R5 (CCR5 tropic) and R5X4
(viruses able to utilize both receptors) [151, 154]. The early dominance of
the CCR5-using phenotype may be due to a number of factors including
selection at the point of transmission [155-157], a higher cytopathicity of SI
variants resulting in host cells with a shorter life span [158] as well as
differing fitness levels of the two variants at different stages of disease
progression [159]. During the progression of infection within the host the
mechanisms that drive the change from the CCR5-using phenotype to the
CXCR4 using phenotype are unclear but a number of theories including the
transmission-mutation hypothesis, target-cell-based hypothesis and the
immune system-based hypothesis have been proposed [160].
These co receptors provide a potentially useful and novel target for
antiretroviral therapy [161] and could help to overcome some of the
problems associated with highly active antiretroviral therapy (HAART) such
as patients adherence to treatment [162] and the emergence of drug-
resistant variants [163, 164]. A number of small positively charged
43
compounds, such as T22 and AMD3100, have been observed to block
CXCR4 - tropic HIV-1 cell entry [165-167]. Inhibiting CXCR4 usage in this
way could theoretically prove useful during the later stages of infection to
slow down or potentially stall disease progression. However CXCR4 is
critical for haematopoiesis, cardiac function and cerebellar development
[168] and so using it as a target could prove difficult due to severe side
effects [169].
Targeting the CCR5 co-receptor is more promising as a natural
polymorphism (CCR5Δ32), with no apparent deleterious effects, exists within
humans resulting in either a reduced or complete absence of the co-receptor
depending on whether an individual is heterozygotic or homozygotic for the
mutation [170]. Individuals that are heterozygous for this mutation were
found to have a slower disease progression to AIDS while individuals that
are homozygotic for the mutation show strong resistance to HIV-1 [170-174].
Maraviroc is a potent orally bio available small-molecule developed by Pfizer
that binds to the CCR5 receptor making it unavailable for HIV-1 cell entry
[175, 176] – thus simulating the CCR5Δ32 phenotype. However in trials the
presence of the CXCR4-using phenotype pre treatment with Maraviroc is
predictive of failure as this phenotype will emerge once treatment begins
[177]. Determining the viral phenotypes present within an individual is thus
of major importance in determining the probability for success.
Three common genomic approaches to phenotype testing are based on
sequence variation within the V3 loop. The first involves looking for the
presence of positively charged amino acid residues, histidine (h), lysine (k)
and arginine (r), at sites 11 or 24 and 25. These sites come into contact with
each other on the surface of the tertiary protein structure to form an exposed
charged patch [178]. When no positive residues are present it is thought
that the electrostatic interaction with the CCR5 coreceptor is stabilized [178,
179]. This test has been traditionally referred to as the charge rule and it
44
demonstrates a 94% accuracy in determining viral phenotype on sequences
of known phenotype [178].
A second method of determining phenotype involves looking at the net
charge across the entire V3 region using position-specific scoring matrices
(PSSM) [180]. Two reference groups of aligned sequences – each made up
of V3 subtype B sequences with a specific phenotype are compared to an
input V3 sequence. A matrix of log likelihood ratio scores for each site of the
input sequence is generated which reflects the difference in abundance of a
particular amino acid within the CXCR4-using dataset to that within the
CCR5 dataset. For the input sequence, the log likelihood ratio for the
particular amino acid at each site is then read from the matrix. The PSSM
score for the input sequence is the sum of the score at each site. This
method of predicting phenotype provided a sensitivity of 84% and a
specificity of 96% in detecting the CXCR4-using phenotype when performed
on V3 sequences with known phenotype [180]. A web interface for the
method is available at:
http://ubik.microbiol.washington.edu/computing/pssm/. Recently, reference
datasets for subtype C have been added to allow for subtype C phenotype
predictions [181]. For the latter a specificity of 94% and a sensitivity of 75%
were achieved on sequences of known phenotype.
A third common method of genotypically determining phenotype is referred
to as geno2pheno [182]. Here CXCR4 and CCR5 training datasets are
again used in a position comparison but factors such as the occurrence of
specific amino acids as well as glycosylation sites are also taken into
account as well as charge. Data is then given to a Support Vector Machine
in order to train the prediction model. A query V3 sequence can then be
aligned and a prediction given using the learned model. An accuracy of
71.8% is achieved when predicting the CCR4-using phenotype with this
method [183].
45
Although the charge rule works well the accuracy the nature of geno2pheno
and the PSSM test could suggest that there may be other sites within the V3
region that contribute to co-receptor tropism. A proposed advantage of the
latter over the charge rule is that they may provide an opportunity for early
detection of the CXCR4-using phenotype, as more subtle alterations within
the V3 sequence can be detected. However a disadvantage is that full
length V3 sequences are required and in light of new sequencing
technologies, such full length V3 sequences are not always provided [184].
Pyrosequencing is an alternative technique, in relation to the more traditional
Sanger sequencing [185], for sequencing DNA [186]. It is a non-gel-based
sequencing technology that is based on the detection of nucleotide
incorporation (Fig. 1.11). Nucleotides are added via primer-directed
polymerase extension. Initially the DNA fragment of interest is incubated
with four enzymes: DNA polymerase, ATP sulfurylase, firefly luciferase, and
a nucleotide-degrading enzyme. The four nucleotide bases are then added
and removed in a cyclic order. If a nucleotide is complementary with the
next base in the template and a base pair is created by DNA polymerase a
phosphate (PPi) is released. This is then converted to ATP by ATP
sulfurylase which is in turn used by firefly luciferase to generate detectable
light. The nucleotide-degrading enzyme is used to rapidly clear nucleotides
that have not been incorporated into the growing sequence. A modification
of this technology [184] can be used to rapidly produce unprecedented
quantities (in the order of hundreds of thousands of reads) of accurate [187]
genomic sequence data (Chapter 7).
46
Figure 1.11 How Pyrosequencing Works
See text for details. Modified from [188].
Such “ultra-deep” sequencing in conjunction with genomic phenotype testing
permits the quantification of the range of sequence variants present within
an individual HIV-1 sample [189]. This can be used to track the emergence
of low frequency viral variants [188, 190]. Identification of low-frequency
variants is of particular importance in the context of co-receptor usage
(specifically CCR5 versus CXCR4-using phenotypes) where specific amino
acids are associated with receptor usage. This is particularly relevant in
relation to Pfizer’s new antiviral drug, Maraviroc [177]. However currently
there are few standard automated protocols for handling and analyzing the
large numbers of short (100 – 200bp) segments produced by this technology
[191, 192] and these depend on a large amount of computational power and
storage facilities. The development of such a protocol will be the subject of
chapter 7.
Remaining Chapters The remaining chapters of this thesis, with the exception of chapters 2 and 8,
will discuss in detail the topics introduced within this introduction chapter. An
47
overview of each of these chapters can be found within their accompanying
abstracts. Publication details are as follows:
Chapter 3: CTree - comparison of clusters between phylogenetic trees made easy
John Archer and David L. Robertson
Bioinformatics. (2007) 23(21):2952-3
JA wrote the program and prepared the manuscript. DLR
revised the manuscript and guided the programs development.
Both JA and DLR devised the project.
Chapter 4: Understanding the Diversification of HIV-1 groups M and O
John Archer and David L. Robertson
AIDS. 2007 Aug 20;21(13):1693-700.
JA performed the analysis and wrote the initial draft of the
manuscript. DLR revised the manuscript and guided the
analysis. Both JA and DLR devised the project.
Chapter 5: Prediction of HIV-1 Recombination Breakpoint Location
PLOS Computational Biology (accepted pending minor
alterations to the text)
John Archer, John W. Pinney, Etienne Simon-Loriere, Jun
Fan, Eric J. Arts, Matteo Negroni and David L. Robertson
JA performed the analysis and prepared the initial manuscript
under the guidance of DLR. JP revised the manuscript. ESL,
JF, EJA and MN provided data for the analysis. Both JA and
DLR devised the project.
Chapter 6: A Strategy for Identifying Significant HIV Sequence Diversity Using Structural Constraints
PLOS Computational Biology (submitted)
48
John Archer, Simon G. Williams, Simon C. Lovell, David L.
Robertson
JA developed and implemented the algorithm for generating
sequence constructs from the input alignments. JA also
performed the coverage analysis for the paper. SW developed
the model for reducing sequenced diversity with the
alignments. SL and DLR guided the project and revised the
manuscript.
Chapter 7: Detection of Low Frequency CXCR4-Using HIV-1 Strains with Ultra – Deep Pyrosequencing
(In preparation)
John Archer, Marilyn Lewis, David L. Robertson
JA developed and implemented the software for analyzing the
data. JA wrote the manuscript. ML provided the data and
helpful discussion. DR revised the manuscript and guided the
project. Both JA and DLR devised the protocol for detecting
low frequency variants within viral population using the
pyrosequenced data.
Chapter 2 is an introduction to the bioinformatics procedures, such as
sequence alignments, models of sequence evolution and phylogenetic trees
that will be used throughout the thesis. Chapter 8 is the concluding chapter
where each of the research chapters is tied together and where some final
comments about the evolution of HIV-1 are presented.
49
References 1. Grenfell, B.T., et al., Unifying the epidemiological and evolutionary
dynamics of pathogens. Science, 2004. 303(5656): p. 327-32.
2. UNAIDS, AIDS Epidemic Update 2007. 2007.
3. Korber, B., et al., Timing the ancestor of the HIV-1 pandemic strains. Science, 2000. 288(5472): p. 1789-96.
4. Foley, B., An Overview of the molecular phylogeny of lentiviruses. 2000, HIV sequence compendium.
5. Coffin, J., J. Hughes, and V. HE, Retroviruses. 1997(1).
6. Wei, X., et al., Viral dynamics in human immunodeficiency virus type 1 infection. Nature, 1995. 373(6510): p. 117-22.
7. Wolinsky, S.M., et al., Adaptive evolution of human immunodeficiency virus-type 1 during the natural course of infection. Science, 1996. 272(5261): p. 537-42.
8. Malim, M.H. and M. Emerman, HIV-1 sequence variation: drift, shift, and attenuation. Cell, 2001. 104(4): p. 469-72.
9. Jetzt, A.E., et al., High rate of recombination throughout the human immunodeficiency virus type 1 genome. J Virol, 2000. 74(3): p. 1234-40.
10. Robertson, D.L., B.H. Hahn, and P.M. Sharp, Recombination in AIDS viruses. J Mol Evol, 1995. 40(3): p. 249-59.
11. Yang, W., J.P. Bielawski, and Z. Yang, Widespread adaptive evolution in the human immunodeficiency virus type 1 genome. J Mol Evol, 2003. 57(2): p. 212-21.
12. Choisy, M., et al., Comparative study of adaptive molecular evolution in different human immunodeficiency virus groups and subtypes. J Virol, 2004. 78(4): p. 1962-70.
50
13. Taylor, J.E. and B.T. Korber, HIV-1 intra-subtype superinfection rates: estimates using a structured coalescent with recombination. Infect Genet Evol, 2005. 5(1): p. 85-95.
14. Archer, J. and D.L. Robertson, Understanding the diversification of HIV-1 groups M and O. Aids, 2007. 21(13): p. 1693-700.
15. Williams, S.G., Archer, J., Lovell, S.C., Robertson, D.L, A rational strategy for HIV vaccine design based on the prediction of sequence evolution. PLoS Comput Biol, 2008 (accepted pending alterations).
16. Robertson, D.L., et al., HIV-1 nomenclature proposal. Science, 2000. 288(5463): p. 55-6.
17. Korber, B., et al., Evolutionary and immunological implications of contemporary HIV-1 variation. Br Med Bull, 2001. 58: p. 19-42.
18. Rambaut, A., et al., Human immunodeficiency virus. Phylogeny and the origin of HIV-1. Nature, 2001. 410(6832): p. 1047-8.
19. Robertson, D.L., et al., Recombination in HIV-1. Nature, 1995. 374(6518): p. 124-6.
20. Abecasis, A.B., et al., Recombination confounds the early evolutionary history of human immunodeficiency virus type 1: subtype G is a circulating recombinant form. J Virol, 2007. 81(16): p. 8543-51.
21. Thomson, M.M., L. Perez-Alvarez, and R. Najera, Molecular epidemiology of HIV-1 genetic forms and its significance for vaccine development and therapy. Lancet Infect Dis, 2002. 2(8): p. 461-71.
22. Janssens, W., A. Buve, and J.N. Nkengasong, The puzzle of HIV-1 subtypes in Africa. Aids, 1997. 11(6): p. 705-12.
23. Vidal, N., et al., Unprecedented degree of human immunodeficiency virus type 1 (HIV-1) group M genetic diversity in the Democratic Republic of Congo suggests that the HIV-1 pandemic originated in Central Africa. J Virol, 2000. 74(22): p. 10498-507.
24. Gilbert, M.T., et al., The emergence of HIV/AIDS in the Americas and beyond. Proc Natl Acad Sci U S A, 2007. 104(47): p. 18566-70.
51
25. Kalish, M.L., et al., Recombinant viruses and early global HIV-1 epidemic. Emerg Infect Dis, 2004. 10(7): p. 1227-34.
26. Worobey, M., The Origins and Diversification of HIV. Global HIV/AIDS Medicine, 2007: p. 13–21.
27. Weaver, E.A., et al., Cross-subtype T-cell immune responses induced by a human immunodeficiency virus type 1 group m consensus env immunogen. J Virol, 2006. 80(14): p. 6745-56.
28. Gaschen, B., et al., Diversity considerations in HIV-1 vaccine selection. Science, 2002. 296(5577): p. 2354-60.
29. Vidal, N., et al., Distribution of HIV-1 variants in the Democratic Republic of Congo suggests increase of subtype C in Kinshasa between 1997 and 2002. J Acquir Immune Defic Syndr, 2005. 40(4): p. 456-62.
30. Li, W.H., M. Tanimura, and P.M. Sharp, Rates and dates of divergence between AIDS virus nucleotide sequences. Mol Biol Evol, 1988. 5(4): p. 313-30.
31. Sabino, E.C., et al., Identification of human immunodeficiency virus type 1 envelope genes recombinant between subtypes B and F in two epidemiologically linked individuals from Brazil. J Virol, 1994. 68(10): p. 6340-6.
32. Osmanov, S., et al., Estimated global distribution and regional spread of HIV-1 genetic subtypes in the year 2000. J Acquir Immune Defic Syndr, 2002. 29(2): p. 184-90.
33. Dowling, W.E., et al., Forty-one near full-length HIV-1 sequences from Kenya reveal an epidemic of subtype A and A-containing recombinants. Aids, 2002. 16(13): p. 1809-20.
34. McCutchan, F.E., et al., HIV type 1 circulating recombinant form CRF09_cpx from west Africa combines subtypes A, F, G, and may share ancestors with CRF02_AG and Z321. AIDS Res Hum Retroviruses, 2004. 20(8): p. 819-26.
35. Monno, L., et al., HIV-1 subtypes and circulating recombinant forms (CRFs) from HIV-infected patients residing in two regions of central and southern Italy. J Med Virol, 2005. 75(4): p. 483-90.
52
36. Ndembi, N., et al., Genetic diversity of HIV type 1 in rural eastern Cameroon. J Acquir Immune Defic Syndr, 2004. 37(5): p. 1641-50.
37. Tee, K.K., et al., Emergence of HIV-1 CRF01_AE/B unique recombinant forms in Kuala Lumpur, Malaysia. Aids, 2005. 19(2): p. 119-26.
38. Takeb, E.Y., S. Kusagawa, and K. Motomura, Molecular epidemiology of HIV: tracking AIDS pandemic. Pediatr Int, 2004. 46(2): p. 236-44.
39. Najera, R., et al., Genetic recombination and its role in the development of the HIV-1 pandemic. Aids, 2002. 16 Suppl 4: p. S3-16.
40. Hoelscher, M., et al., High proportion of unrelated HIV-1 intersubtype recombinants in the Mbeya region of southwest Tanzania. Aids, 2001. 15(12): p. 1461-70.
41. Liitsola, K., et al., HIV-1 genetic subtype A/B recombinant strain causing an explosive epidemic in injecting drug users in Kaliningrad. Aids, 1998. 12(14): p. 1907-19.
42. Montavon, C., et al., CRF06-cpx: a new circulating recombinant form of HIV-1 in West Africa involving subtypes A, G, K, and J. J Acquir Immune Defic Syndr, 2002. 29(5): p. 522-30.
43. Charneau, P., et al., Isolation and envelope sequence of a highly divergent HIV-1 isolate: definition of a new HIV-1 group. Virology, 1994. 205(1): p. 247-53.
44. Vanden Haesevelde, M., et al., Genomic cloning and complete sequence analysis of a highly divergent African human immunodeficiency virus isolate. J Virol, 1994. 68(3): p. 1586-96.
45. Gurtler, L.G., et al., A new subtype of human immunodeficiency virus type 1 (MVP-5180) from Cameroon. J Virol, 1994. 68(3): p. 1581-5.
46. Yamaguchi, J., et al., Near full-length genomes of 15 HIV type 1 group O isolates. AIDS Res Hum Retroviruses, 2003. 19(11): p. 979-88.
53
47. Mauclere, P., et al., Serological and virological characterization of HIV-1 group O infection in Cameroon. Aids, 1997. 11(4): p. 445-53.
48. Yamaguchi, J., et al., HIV infections in northwestern Cameroon: identification of HIV type 1 group O and dual HIV type 1 group M and group O infections. AIDS Res Hum Retroviruses, 2004. 20(9): p. 944-57.
49. Quinones-Mateu, M.E., S.C. Ball, and E.J. Arts, Role of Human Immunodeficiency Virus Type 1 Group O in the AIDS Pandemic. AIDS Rev., 2000. 2: p. 190 - 202.
50. Roques, P., et al., Phylogenetic analysis of 49 newly derived HIV-1 group O strains: high viral diversity but no group M-like subtype structure. Virology, 2002. 302(2): p. 259-73.
51. Jonassen, T.O., et al., Sequence analysis of HIV-1 group O from Norwegian patients infected in the 1960s. Virology, 1997. 231(1): p. 43-7.
52. Lemey, P., et al., The molecular population genetics of HIV-1 group O. Genetics, 2004. 167(3): p. 1059-68.
53. Janssens, W., et al., Interpatient genetic variability of HIV-1 group O. Aids, 1999. 13(1): p. 41-8.
54. Loussert-Ajaka, I., et al., Variability of human immunodeficiency virus type 1 group O strains isolated from Cameroonian patients living in France. J Virol, 1995. 69(9): p. 5640-9.
55. Yamaguchi, J., et al., Evaluation of HIV type 1 group O isolates: identification of five phylogenetic clusters. AIDS Res Hum Retroviruses, 2002. 18(4): p. 269-82.
56. Hu, W.S. and H.M. Temin, Retroviral recombination and reverse transcription. Science, 1990. 250(4985): p. 1227-33.
57. Ayouba, A., et al., HIV-1 group O infection in Cameroon, 1986 to 1998. Emerg Infect Dis, 2001. 7(3): p. 466-7.
58. Ayouba, A., et al., HIV-1 group N among HIV-1-seropositive individuals in Cameroon. Aids, 2000. 14(16): p. 2623-5.
54
59. Bodelle, P., et al., Identification and genomic sequence of an HIV type 1 group N isolate from Cameroon. AIDS Res Hum Retroviruses, 2004. 20(8): p. 902-8.
60. Sharp, P.M., G.M. Shaw, and B.H. Hahn, Simian immunodeficiency virus infection of chimpanzees. J Virol, 2005. 79(7): p. 3891-902.
61. Hirsch, V.M., et al., An African primate lentivirus (SIVsm) closely related to HIV-2. Nature, 1989. 339(6223): p. 389-92.
62. Holmes, E.C., On the origin and evolution of the human immunodeficiency virus (HIV). Biol Rev Camb Philos Soc, 2001. 76(2): p. 239-54.
63. Hahn, B.H., et al., AIDS as a zoonosis: scientific and public health implications. Science, 2000. 287(5453): p. 607-14.
64. Allan, J.S., et al., Species-specific diversity among simian immunodeficiency viruses from African green monkeys. J Virol, 1991. 65(6): p. 2816-28.
65. Allan, J.S., et al., Isolation and characterization of simian immunodeficiency viruses from two subspecies of African green monkeys. AIDS Res Hum Retroviruses, 1990. 6(3): p. 275-85.
66. Shimada, M.K., K. Terao, and T. Shotake, Mitochondrial sequence diversity within a subspecies of savanna monkeys (Cercopithecus aethiops) is similar to that between subspecies. J Hered, 2002. 93(1): p. 9-18.
67. Sharp, P.M., et al., Origins and evolution of AIDS viruses: estimating the time-scale. Biochem Soc Trans, 2000. 28(2): p. 275-82.
68. Charleston, M.A. and D.L. Robertson, Preferential host switching by primate lentiviruses can account for phylogenetic similarity with the primate phylogeny. Syst Biol, 2002. 51(3): p. 528-35.
69. Wertheim, J.O. and M. Worobey, A Challenge to the Ancient Origin of SIVagm Based on African Green Monkey Mitochondrial Genomes. PLoS Pathog, 2007. 3(7): p. e95.
70. Corbet, S., et al., env sequences of simian immunodeficiency viruses from chimpanzees in Cameroon are strongly related to those of
55
human immunodeficiency virus group N from the same geographic area. J Virol, 2000. 74(1): p. 529-34.
71. Gao, F., et al., Origin of HIV-1 in the chimpanzee Pan troglodytes troglodytes. Nature, 1999. 397(6718): p. 436-41.
72. Huet, T., et al., Genetic organization of a chimpanzee lentivirus related to HIV-1. Nature, 1990. 345(6273): p. 356-9.
73. Vanden Haesevelde, M.M., et al., Sequence analysis of a highly divergent HIV-1-related lentivirus isolated from a wild captured chimpanzee. Virology, 1996. 221(2): p. 346-50.
74. Peeters, M., et al., Isolation and characterization of a new chimpanzee lentivirus (simian immunodeficiency virus isolate cpz-ant) from a wild-captured chimpanzee. Aids, 1992. 6(5): p. 447-51.
75. Van Heuverswyn, F., et al., Human immunodeficiency viruses: SIV infection in wild gorillas. Nature, 2006. 444(7116): p. 164.
76. Bibollet-Ruche, F., et al., Complete genome analysis of one of the earliest SIVcpzPtt strains from Gabon (SIVcpzGAB2). AIDS Res Hum Retroviruses, 2004. 20(12): p. 1377-81.
77. Keele, B.F., et al., Chimpanzee reservoirs of pandemic and nonpandemic HIV-1. Science, 2006. 313(5786): p. 523-6.
78. Nerrienet, E., et al., Simian immunodeficiency virus infection in wild-caught chimpanzees from cameroon. J Virol, 2005. 79(2): p. 1312-9.
79. Fonjungo, P.N., et al., Molecular screening for HIV-1 group N and simian immunodeficiency virus cpz-like virus infections in Cameroon. Aids, 2000. 14(6): p. 750-2.
80. Roques, P., et al., Phylogenetic characteristics of three new HIV-1 N strains and implications for the origin of group N. Aids, 2004. 18(10): p. 1371-81.
81. Simon, F., et al., Identification of a new human immunodeficiency virus type 1 distinct from group M and group O. Nat Med, 1998. 4(9): p. 1032-7.
56
82. Jin, M.J., et al., Infection of a yellow baboon with simian immunodeficiency virus from African green monkeys: evidence for cross-species transmission in the wild. J Virol, 1994. 68(12): p. 8454-60.
83. Sharp, P.M., D.L. Robertson, and B.H. Hahn, Cross-species transmission and recombination of 'AIDS' viruses. Philos Trans R Soc Lond B Biol Sci, 1995. 349(1327): p. 41-7.
84. Souquiere, S., et al., Wild Mandrillus sphinx are carriers of two types of lentivirus. J Virol, 2001. 75(15): p. 7086-96.
85. Bailes, E., et al., Hybrid origin of SIV in chimpanzees. Science, 2003. 300(5626): p. 1713.
86. Courgnaud, V., et al., Characterization of a novel simian immunodeficiency virus with a vpu gene from greater spot-nosed monkeys (Cercopithecus nictitans) provides new insights into simian/human immunodeficiency virus phylogeny. J Virol, 2002. 76(16): p. 8298-309.
87. Beer, B.E., et al., Characterization of novel simian immunodeficiency viruses from red-capped mangabeys from Nigeria (SIVrcmNG409 and -NG411). J Virol, 2001. 75(24): p. 12014-27.
88. Mitani, J.C. and D.P. Watts, Demographic influences on the hunting behavior of chimpanzees. Am J Phys Anthropol, 1999. 109(4): p. 439-54.
89. Uehara, S., Predation on mammals by the chimpanzee (Pan troglodytes). Primates, 1997. 38: p. 198-214.
90. Leitner, T., et al., Sequence Diversity among Chimpanzee Simian Immunodeficiency Viruses (SIVcpz) Suggests that SIVcpzPts Was Derived from SIVcpzPtt through Additional Recombination Events. AIDS Res Hum Retroviruses, 2007. 23(9): p. 1114-8.
91. Paraskevis, D., et al., Analysis of the evolutionary relationships of HIV-1 and SIVcpz sequences using bayesian inference: implications for the origin of HIV-1. Mol Biol Evol, 2003. 20(12): p. 1986-96.
57
92. Arien, K.K., et al., The replicative fitness of primary human immunodeficiency virus type 1 (HIV-1) group M, HIV-1 group O, and HIV-2 isolates. J Virol, 2005. 79(14): p. 8979-90.
93. Webby, R., E. Hoffmann, and R. Webster, Molecular constraints to interspecies transmission of viral pathogens. Nat Med, 2004. 10(12 Suppl): p. S77-81.
94. Vanderford, T.H., et al., Adaptation of a diverse simian immunodeficiency virus population to a new host is revealed through a systematic approach to identify amino acid sites under selection. Mol Biol Evol, 2007. 24(3): p. 660-9.
95. Wain, L.V., et al., Adaptation of HIV-1 to its human host. Mol Biol Evol, 2007. 24(8): p. 1853-60.
96. Sharp, P.M., et al., Origins and evolution of AIDS viruses. Biol Bull, 1999. 196(3): p. 338-42.
97. Wolfe, N.D., et al., Naturally acquired simian retrovirus infections in central African hunters. Lancet, 2004. 363(9413): p. 932-7.
98. Temin, H.M., Sex and recombination in retroviruses. Trends Genet, 1991. 7(3): p. 71-4.
99. Rambaut, A., et al., The causes and consequences of HIV evolution. Nat Rev Genet, 2004. 5(1): p. 52-61.
100. van Rij, R.P., et al., Evolution of R5 and X4 human immunodeficiency virus type 1 gag sequences in vivo: evidence for recombination. Virology, 2003. 314(1): p. 451-9.
101. Boisier, P., et al., Nationwide HIV prevalence survey in general population in Niger. Trop Med Int Health, 2004. 9(11): p. 1161-6.
102. Diaz, R.S., et al., Dual human immunodeficiency virus type 1 infection and recombination in a dually exposed transfusion recipient. The Transfusion Safety Study Group. J Virol, 1995. 69(6): p. 3273-81.
103. Fang, G., et al., Recombination following superinfection by HIV-1. Aids, 2004. 18(2): p. 153-9.
58
104. Salminen, M.O., et al., Evolution and probable transmission of intersubtype recombinant human immunodeficiency virus type 1 in a Zambian couple. J Virol, 1997. 71(4): p. 2647-55.
105. Yerly, S., et al., HIV-1 co/super-infection in intravenous drug users. Aids, 2004. 18(10): p. 1413-21.
106. Troyer, J.L., et al., Patterns of feline immunodeficiency virus multiple infection and genome divergence in a free-ranging population of African lions. J Virol, 2004. 78(7): p. 3777-91.
107. Takehisa, J., et al., Various types of HIV mixed infections in Cameroon. Virology, 1998. 245(1): p. 1-10.
108. Goulder, P.J. and B.D. Walker, HIV-1 superinfection--a word of caution. N Engl J Med, 2002. 347(10): p. 756-8.
109. Galetto, R., et al., Dissection of a circumscribed recombination hot spot in HIV-1 after a single infectious cycle. J Biol Chem, 2006. 281(5): p. 2711-20.
110. Lemey, P. and D. Posada, Book Chapter - Introduction to recombination detection. 2007.
111. Negroni, M. and H. Buc, Retroviral recombination: what drives the switch? Nat Rev Mol Cell Biol, 2001. 2(2): p. 151-5.
112. Baird, H.A., et al., Sequence determinants of breakpoint location during HIV-1 intersubtype recombination. Nucleic Acids Res, 2006. 34(18): p. 5203-16.
113. Fan, J., M. Negroni, and D.L. Robertson, The distribution of HIV-1 recombination breakpoints. Infect Genet Evol, 2007. 7(6): p. 717-23.
114. Magiorkinis, G., et al., In vivo characteristics of human immunodeficiency virus type 1 intersubtype recombination: determination of hot spots and correlation with sequence similarity. J Gen Virol, 2003. 84(Pt 10): p. 2715-22.
115. Zhang, J. and H.M. Temin, Retrovirus recombination depends on the length of sequence identity and is not error prone. J Virol, 1994. 68(4): p. 2409-14.
59
116. Moumen, A., et al., The HIV-1 repeated sequence R as a robust hot-spot for copy-choice recombination. Nucleic Acids Res, 2001. 29(18): p. 3814-21.
117. Klarmann, G.J., C.A. Schauber, and B.D. Preston, Template-directed pausing of DNA synthesis by HIV-1 reverse transcriptase during polymerization of HIV-1 sequences in vitro. J Biol Chem, 1993. 268(13): p. 9793-802.
118. Derebail, S.S. and J.J. DeStefano, Mechanistic analysis of pause site-dependent and -independent recombinogenic strand transfer from structurally diverse regions of the HIV genome. J Biol Chem, 2004. 279(46): p. 47446-54.
119. Lanciault, C. and J.J. Champoux, Pausing during reverse transcription increases the rate of retroviral recombination. J Virol, 2006. 80(5): p. 2483-94.
120. Roda, R.H., et al., Strand transfer occurs in retroviruses by a pause-initiated two-step mechanism. J Biol Chem, 2002. 277(49): p. 46900-11.
121. Baird, H.A., et al., Influence of sequence identity and unique breakpoints on the frequency of intersubtype HIV-1 recombination. Retrovirology, 2006. 3: p. 91.
122. Margot, N.A., J.M. Waters, and M.D. Miller, In vitro human immunodeficiency virus type 1 resistance selections with combinations of tenofovir and emtricitabine or abacavir and lamivudine. Antimicrob Agents Chemother, 2006. 50(12): p. 4087-95.
123. Perno, C.F., V. Svicher, and F. Ceccherini-Silberstein, Novel drug resistance mutations in HIV: recognition and clinical relevance. AIDS Rev, 2006. 8(4): p. 179-90.
124. Guillon, C., et al., Evidence for CTL-mediated selection of Tat and Rev mutants after the onset of the asymptomatic period during HIV type 1 infection. AIDS Res Hum Retroviruses, 2006. 22(12): p. 1283-92.
125. Pillay, T., et al., Unique acquisition of cytotoxic T-lymphocyte escape mutants in infant human immunodeficiency virus type 1 infection. J Virol, 2005. 79(18): p. 12100-5.
60
126. Iversen, A.K., et al., Conflicting selective forces affect T cell receptor contacts in an immunodominant human immunodeficiency virus epitope. Nat Immunol, 2006. 7(2): p. 179-89.
127. Ciurea, A., et al., CD4+ T-cell-epitope escape mutant virus selected in vivo. Nat Med, 2001. 7(7): p. 795-800.
128. Draenert, R., et al., Immune selection for altered antigen processing leads to cytotoxic T lymphocyte escape in chronic HIV-1 infection. J Exp Med, 2004. 199(7): p. 905-15.
129. Li, Y., et al., Broad HIV-1 neutralization mediated by CD4-binding site antibodies. Nat Med, 2007. 13(9): p. 1032-4.
130. Mthunzi, P. and D. Meyer, Limited cross-reactivity between different HIV-1 clades. J Clin Virol, 2004. 31 Suppl 1: p. S88-91.
131. McMichael, A.J., HIV vaccines. Annu Rev Immunol, 2006. 24: p. 227-55.
132. Borrow, P., et al., Virus-specific CD8+ cytotoxic T-lymphocyte activity associated with control of viremia in primary human immunodeficiency virus type 1 infection. J Virol, 1994. 68(9): p. 6103-10.
133. Koup, R.A., et al., Temporal association of cellular immune responses with the initial control of viremia in primary human immunodeficiency virus type 1 syndrome. J Virol, 1994. 68(7): p. 4650-5.
134. Musey, L., et al., Cytotoxic-T-cell responses, viral load, and disease progression in early human immunodeficiency virus type 1 infection. N Engl J Med, 1997. 337(18): p. 1267-74.
135. Rinaldo, C., et al., High levels of anti-human immunodeficiency virus type 1 (HIV-1) memory cytotoxic T-lymphocyte activity and low viral load are associated with lack of disease in HIV-1-infected long-term nonprogressors. J Virol, 1995. 69(9): p. 5838-42.
136. Wang, S., et al., Association between HIV Type 1-specific T cell responses and CD4+ T cell counts or CD4+:CD8+ T cell ratios in HIV Type 1 subtype B infection in China. AIDS Res Hum Retroviruses, 2006. 22(8): p. 780-7.
61
137. Yusim, K., et al., Clustering patterns of cytotoxic T-lymphocyte epitopes in human immunodeficiency virus type 1 (HIV-1) proteins reveal imprints of immune evasion on HIV-1 global variation. J Virol, 2002. 76(17): p. 8757-68.
138. Nickle, D.C., et al., Consensus and ancestral state HIV vaccines. Science, 2003. 299(5612): p. 1515-8; author reply 1515-8.
139. Fischer, W., et al., Polyvalent vaccines for optimal coverage of potential T-cell epitopes in global HIV-1 variants. Nat Med, 2007. 13(1): p. 100-6.
140. Nickle, D.C., et al., Coping with viral diversity in HIV vaccine design. PLoS Comput Biol, 2007. 3(4): p. e75.
141. Koot, M., et al., HIV-1 biological phenotype in long-term infected individuals evaluated with an MT-2 cocultivation assay. Aids, 1992. 6(1): p. 49-54.
142. Shankarappa, R., et al., Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. J Virol, 1999. 73(12): p. 10489-502.
143. Connor, R.I. and D.D. Ho, Human immunodeficiency virus type 1 variants with increased replicative capacity develop during the asymptomatic stage before disease progression. J Virol, 1994. 68(7): p. 4400-8.
144. Koot, M., et al., Conversion rate towards a syncytium-inducing (SI) phenotype during different stages of human immunodeficiency virus type 1 infection and prognostic value of SI phenotype for survival after AIDS diagnosis. J Infect Dis, 1999. 179(1): p. 254-8.
145. Starcich, B.R., et al., Identification and characterization of conserved and variable regions in the envelope gene of HTLV-III/LAV, the retrovirus of AIDS. Cell, 1986. 45(5): p. 637-48.
146. Leonard, C.K., et al., Assignment of intrachain disulfide bonds and characterization of potential glycosylation sites of the type 1 recombinant human immunodeficiency virus envelope glycoprotein (gp120) expressed in Chinese hamster ovary cells. J Biol Chem, 1990. 265(18): p. 10373-82.
62
147. De Jong, J.J., et al., Minimal requirements for the human immunodeficiency virus type 1 V3 domain to support the syncytium-inducing phenotype: analysis by single amino acid substitution. J Virol, 1992. 66(11): p. 6777-80.
148. Fouchier, R.A., et al., Phenotype-associated sequence variation in the third variable domain of the human immunodeficiency virus type 1 gp120 molecule. J Virol, 1992. 66(5): p. 3183-7.
149. Fouchier, R.A., et al., Simple determination of human immunodeficiency virus type 1 syncytium-inducing V3 genotype by PCR. J Clin Microbiol, 1995. 33(4): p. 906-11.
150. Dimitrov, D.S., et al., HIV coreceptors. J Membr Biol, 1998. 166(2): p. 75-90.
151. Moore, J.P., et al., The CCR5 and CXCR4 coreceptors--central to understanding the transmission and pathogenesis of human immunodeficiency virus type 1 infection. AIDS Res Hum Retroviruses, 2004. 20(1): p. 111-26.
152. Bjorndal, A., et al., Coreceptor usage of primary human immunodeficiency virus type 1 isolates varies according to biological phenotype. J Virol, 1997. 71(10): p. 7478-87.
153. Connor, R.I., et al., Change in coreceptor use coreceptor use correlates with disease progression in HIV-1--infected individuals. J Exp Med, 1997. 185(4): p. 621-8.
154. Berger, E.A., et al., A new classification for HIV-1. Nature, 1998. 391(6664): p. 240.
155. Agace, W.W., et al., Constitutive expression of stromal derived factor-1 by mucosal epithelia and its role in HIV transmission and propagation. Curr Biol, 2000. 10(6): p. 325-8.
156. Meng, G., et al., Primary intestinal epithelial cells selectively transfer R5 HIV-1 to CCR5+ cells. Nat Med, 2002. 8(2): p. 150-6.
157. Reece, J.C., et al., HIV-1 selection by epidermal dendritic cells during transmission across human skin. J Exp Med, 1998. 187(10): p. 1623-31.
63
158. Rodrigo, A.G., Dynamics of syncytium-inducing and non-syncytium-inducing type 1 human immunodeficiency viruses during primary infection. AIDS Res Hum Retroviruses, 1997. 13(17): p. 1447-51.
159. Arien, K.K., et al., Replicative fitness of CCR5-using and CXCR4-using human immunodeficiency virus type 1 biological clones. Virology, 2006. 347(1): p. 65-74.
160. Regoes, R.R. and S. Bonhoeffer, The HIV coreceptor switch: a population dynamical perspective. Trends Microbiol, 2005. 13(6): p. 269-77.
161. Pierson, T.C., R.W. Doms, and S. Pohlmann, Prospects of HIV-1 entry inhibitors as novel therapeutics. Rev Med Virol, 2004. 14(4): p. 255-70.
162. Ickovics, J.R. and C.S. Meade, Adherence to HAART among patients with HIV: breakthroughs and barriers. AIDS Care, 2002. 14(3): p. 309-18.
163. Kutilek, V.D., et al., Is resistance futile? Curr Drug Targets Infect Disord, 2003. 3(4): p. 295-309.
164. Little, S., et al., Antiretroviral-drug resistance among patients recently infected with HIV. Engl J Med, 2002. 347(6): p. 385-94.
165. Doranz, B., et al., A small-molecule inhibitor directed against the chemokine receptor CXCR4 prevents its use as an HIV-1 coreceptor. Journal of Experimental Medicine, 1997. 186(8): p. 1395-400.
166. Murakami, T., et al., A small molecule CXCR4 inhibitor that blocks T cell line-tropic HIV-1 infection. Journal Experimental Medicine, 1997. 186(8): p. 1389-93.
167. Schols, D., et al., Inhibition of T-tropic HIV strains by selective antagonization of the chemokine receptor CXCR4. J Exp Med, 1997. 186(8): p. 1383-8.
168. Zou, Y.R., et al., Function of the chemokine receptor CXCR4 in haematopoiesis and in cerebellar development. Nature, 1998. 393(6685): p. 595-9.
64
169. Scozzafava, A., A. Mastrolorenzo, and C. Supuran, Non-peptidic chemokine receptors antagonists as emerging anti-HIV agents. J Enzyme Inhib Med Chem, 2002. 17(2): p. 69-76.
170. O'Brien, T., et al., HIV-1 infection in a man homozygous for CCR5Δ32. The Lancet, 1997. 349(9060): p. 1219-1219.
171. Dean, M., et al., Genetic restriction of HIV-1 infection and progression to AIDS by a deletion allele of the CKR5 structural gene. Hemophilia Growth and Development Study, Multicenter AIDS Cohort Study, Multicenter Hemophilia Cohort Study, San Francisco City Cohort, ALIVE Study. Science, 1996. 273(5283): p. 1856-62.
172. Eugen-Olsen, J., et al., Heterozygosity for a deletion in the CKR-5 gene leads to prolonged AIDS-free survival and slower CD4 T-cell decline in a cohort of HIV-seropositive individuals. Aids, 1997. 11(3): p. 305-10.
173. Huang, Y., et al., The role of a mutant CCR5 allele in HIV-1 transmission and disease progression. Nat Med, 1996. 2(11): p. 1240-3.
174. Meyer, L., et al., Early protective effect of CCR-5 delta 32 heterozygosity on HIV-1 disease progression: relationship with viral load. The SEROCO Study Group. Aids, 1997. 11(11): p. F73-8.
175. Dorr, P., et al., Maraviroc (UK-427,857), a potent, orally bioavailable, and selective small-molecule inhibitor of chemokine receptor CCR5 with broad-spectrum anti-human immunodeficiency virus type 1 activity. Antimicrob Agents Chemother, 2005. 49(11): p. 4721-32.
176. Fatkenheuer, G., et al., Efficacy of short-term monotherapy with maraviroc, a new CCR5 antagonist, in patients infected with HIV-1. Nat Med, 2005. 11(11): p. 1170-2.
177. Westby, M., et al., Emergence of CXCR4-using human immunodeficiency virus type 1 (HIV-1) variants in a minority of HIV-1-infected patients following treatment with the CCR5 antagonist maraviroc is from a pretreatment CXCR4-using virus reservoir. J Virol, 2006. 80(10): p. 4909-20.
178. Cardozo, T., et al., Structural basis for coreceptor selectivity by the HIV type 1 V3 loop. AIDS Res Hum Retroviruses, 2007. 23(3): p. 415-26.
65
179. Rosen, O., et al., Molecular switch for alternative conformations of the HIV-1 V3 region: implications for phenotype conversion. Proc Natl Acad Sci U S A, 2006. 103(38): p. 13950-5.
180. Jensen, M.A., et al., Improved coreceptor usage prediction and genotypic monitoring of R5-to-X4 transition by motif analysis of human immunodeficiency virus type 1 env V3 loop sequences. J Virol, 2003. 77(24): p. 13376-88.
181. Jensen, M.A., et al., A reliable phenotype predictor for human immunodeficiency virus type 1 subtype C based on envelope V3 sequences. J Virol, 2006. 80(10): p. 4698-704.
182. Sing, T., Beerenwinkel, N., Lengauer, T., Learning mixtures of localized rules by maximizing the area under the ROC curve. 2004.
183. Poveda, E., et al., Correlation between a phenotypic assay and three bioinformatic tools for determining HIV co-receptor use. Aids, 2007. 21(11): p. 1487-90.
184. Margulies, M., et al., Genome sequencing in microfabricated high-density picolitre reactors. Nature, 2005. 437(7057): p. 376-80.
185. Sanger, F. and A.R. Coulson, A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J Mol Biol, 1975. 94(3): p. 441-8.
186. Ronaghi, M., M. Uhlen, and P. Nyren, A sequencing method based on real-time pyrophosphate. Science, 1998. 281(5375): p. 363, 365.
187. Huse, S.M., et al., Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol, 2007. 8(7): p. R143.
188. O'Meara, D., et al., Monitoring Resistance to Human Immunodeficiency Virus Type 1 Protease Inhibitors by Pyrosequencing. J. Clin. Microbiol., 2001. 39(2): p. 464-473.
189. Wang, C., et al., Characterization of mutation spectra with ultra-deep pyrosequencing: application to HIV-1 drug resistance. Genome Res, 2007. 17(8): p. 1195-201.
66
190. Hoffmann, C., et al., DNA bar coding and pyrosequencing to identify rare HIV drug resistance mutations. Nucleic Acids Res, 2007. 35(13): p. e91.
191. Trombetti, G.A., et al., Data handling strategies for high throughput pyrosequencers. BMC Bioinformatics, 2007. 8 Suppl 1: p. S22.
192. Lewis, M., et al., Evaluation of an Ultra-Deep Sequencing Method to Identify Minority Sequence Variants in the HIV-1 env Gene from Clinical Samples, in 14th Conference on Retroviruses and Opportunistic Infections. 2007: Los Angeles, USA.
67
Chapter 2: Sequence Data and Phylogenetic Trees
Molecular Phylogeny
Understanding evolutionary relationships between different organisms is a
fundamental aspect of modern day biology. Tree structures are generally
used to depict these relationships. In the days of Charles Darwin rough tree
sketches were based on fossil records, morphology and geographical
distribution [1]. This is no longer the case. With the advent of sequencing
technologies [2] and the realization that both DNA and amino acid
sequences could be used to accurately determine the relationship between
different organisms [3] a plethora of tree producing algorithms have emerged
[4] along with a branch of science referred to as molecular phylogeny.
Molecular phylogeny is the science of estimating evolutionary histories using
DNA and amino acid sequences.
The first step in producing an evolutionary history is the identification of
homologous sequences. These are sequences that share a common
ancestry [5]. There are different types of homology which include orthology
and paralogy. Orthologous sequences share similarities because they
originated from a common ancestor. Paralogous sequences on the other
hand share similarities due to gene duplication events within an individual
species. To infer the evolutionary history between different organisms
orthologous sequences are required. These can be aligned after which trees
representing the evolutionary relationships between the sequences can be
inferred. To improve the accuracy of the evolutionary relationships within the
tree, models of sequence evolution are incorporated. Once a tree has been
created there are many programs available for viewing and analysing the
tree topology. In this chapter a few of the many aspects of aspects of
molecular phylogeny will be discussed.
68
Global Alignments Orthologous HIV sequences can be obtained from the Los Alamos HIV
sequence database using the search interface provided at
http://www.hiv.lanl.gov. Before a tree can be inferred the sequences must
be aligned. The accuracy of the alignment generated will directly affect the
quality any inference of phylogenetic history. In 1970 Needleman and
Wunsch published a progressive alignment algorithm for performing a global
pairwise alignment on two sequences [6]. The algorithm matches together
as many characters as possible between two input sequences regardless of
their lengths. It uses a process referred to as dynamic programming and is
guaranteed to find the alignment with the highest score. The score between
two sequences provides information about their evolutionary relationship to
each other. When more than two sequences are present the scores
between all combinations of sequence pairs form the starting point for
producing a multiple alignment. The most famous programs implementing
this algorithm are the Clustal series of programs [7-10] and the more recent
Muscle [11]. In these programs the Needleman and Wunsch algorithm is
used to align all pairs of sequences within the input dataset in order to obtain
their pairwise scores. These scores are then used in the construction of a
rough guide tree which is in turn used to create a multiple sequence
alignment.
Given two unaligned nucleotide sequences, e.g. seq1 = ATGCTT and seq2 =
ATCTA, the first step to applying the Needleman and Wunsch algorithm is to
define a scoring system in order to score any two matched or unmatched
nucleotides. For example: if the nucleotides are identical the column scores
+1, if the nucleotides are different the column scores -1 and if a gap is
present the column scores -2. Usually there is a distinction between the gap
opening penalty and the extension penalty as from an evolutionary
perspective it is more difficult to open the first gap in the sequence rather
69
than to extend an already existing gap. For nucleotides there is often a
distinction between transitions and tranversions as transitions are more
frequent. Ideally at this point a model of nucleotide evolution would be
incorporated into the alignment process. However methods for incorporating
such models directly into the multiple alignment process are not available
and so models are incorporated during the tree inference process. For
amino acid alignments this scoring system is usually based on a substitution
matrix such as the BLOSUM (Blocks Amino Acid Substitution Matrices) [12]
or PAM (Percent Accepted Mutation) [13] series. These matrices generally
reflect the probabilities of one amino acid mutating to another based on large
datasets of pairwise alignments.
From left to right any partial alignment can now be scored by the sum of the
scores for each column so far. For example using the simple system
outlined above:
1 ATCATG
=
And 0 ATATG
=
−
When a column is added to the right in order to score the alignment we only
need to look at the added column and use the knowledge of the alignment
score of everything that has gone before that column:
0 (-1) 1 TC
ATCATG
=+=
+
Using this method of scoring we can define a scoring matrix (Fig. 2.1) in
which the best score at any given position, with the exception of the first row
and column, is the maximum of one of three values:
70
) index2 , 1) - ex1score((ind 2-1)) - (index2 , x1score(inde 2-
1)) - (index2 1), - ex1score((ind c2) p(c1,max index2) x1,score(inde
+
+
+
=
Where p(c1, c2), based on the simple scoring system above, returns 1 if c1 =
c2 and -1 if c1 ≠ c2 and c1 is the character at index1 on seq1 and c2 is the
character at index2 on seq 2. The -2 represents the gap introducing score.
Figure 2.1 Scoring Matrix for Aligning Two Sequences
A scoring matrix generated for the two example sequences ATGCTT (sequence 1) and
ATCTA (sequence 2). Yellow indicates the score of the alignment so far by the addition of
the prefix C:C. This depends on one of 3 choices as described in the text. Red is the score
of everything that went before when no gap was introduced into the last position. Blue
indicates the previous score when a gap was introduced into the first sequence while green
indicates the previous score when a gap was introduced into sequence 2.
Once the scoring matrix as been populated it can be used by a recursive
algorithm to find the optimum alignment. The algorithm starts at the lower
right hand corner of the matrix and recursively follows the best path into the
matrix until the start has been reached (Fig. 2.2 - left). When the recursion
has ended, a process referred to as trace-back begins (Fig. 2.2 - right). This
is the return journey through the matrix for each of the previous recursive
71
calls. At each step a column is added to the alignment. In the addition of a
column one of three options is chosen:
i. No gap being inserted and both growing sequences being extended
by the next nucleotide (resulting in either a direct match or mismatch).
ii. A gap being inserted in sequence 1 while the next nucleotide is
inserted into sequence 2.
iii. A gap being inserted in sequence 2 while the next nucleotide being
inserted into sequence 1.
Once the trace-back is complete both sequences have been optimally
globally pairwise aligned and the relationship between the two sequences
can be assessed.
Figure 2.2 Recursion and Trace-back
Once recursion has been complete (left – red arrows) trace-back begins (right). During
trace-back the pairwise alignment between the two input sequences is constructed.
Local Alignments
Often only small fragments of sequences are available. To locate the
location of such fragments in relation to a larger sequence or to pairwise
72
align them to each other a local pairwise alignment is more desirable. This
is because it not desirable to attempt to spread the smaller sequence out
along the full length of the larger sequence in the search for matching the
most number of characters as the former forms an already intact part of
some gene. During a local alignment an attempt is made to align sequences
in relation to positions with the highest density of matches. The Smith-
Waterman algorithm for locally aligning two sequences is a slight
modification to the Needleman and Wunsch pairwise alignment algorithm
[14]. Firstly when defining the scoring matrix negatively scoring cells are set
to zero. To do this the scoring formulae score(index1, index2) becomes:
0index2) 1),-ex1score((ind2-
1))-(index2 x1,score(inde 2 -1))-(index2 1),-ex1score((indc2) p(c1,
max index2) x1,score(inde+
+
+
=
This renders all the possible local alignments visible. A second modification
is then at the start of the recursion process. Instead of starting at the lower
right hand corner the recursive process starts at the highest scoring cell and
proceeds until a cell of score zero is encountered. The path back from this
recursion will produce the best local alignment. Local alignments form an
integral part of the analysis presented in chapter 6 where many thousands of
short sequences must be indexed against a consensus template sequence.
Measuring Genetic Change
Once an alignment has been constructed the relationship between the
sequences can be derived from their evolutionary distances to each other. A
simple measure of evolutionary distance is a count of the number of sites
between any two sequences that are different. This is known as the
observed distance or p-distance. However the p-distance can underestimate
73
the number of substitutions that have actually taken place. This is as a
result of individual sites potentially undergoing multiple substitutions which
cannot be detected by a simple count of observed differences. Such a
mutation at an individual site may include an A changing to a T and then to a
C. Here two mutations have occurred but the p-distance will only account for
one. There is also the possibility of back mutation where an A may change
to a T and then back to an A. In this case the p-distance will not take any
mutation into account. A general rule is the larger the p-distance the more
room for error [15]. Over a large amount of time and/or when dealing with
rapidly evolving genomes, such as HIV, it is therefore important to attempt to
rectify this short coming of p-distance to increase the accuracy of an
phylogenetic inference.
As a result many distance correction methods have been developed. These
take into account various parameters such as base frequencies and
transversion and transition change rates. The most common models to date
are Jukes-Cantor [16], Kimura 2 Parameter [17], Felsenstein 81 [18], HKY85
[19] and the General Time Reversible model [20]. In the case of these they
are all related to each other in a nested manner. Jukes-Cantor was the first
model proposed and assumes that the four bases have equal frequencies
and that all substitutions are equally likely. Kimura 2 Parameter extends the
Jukes-Cantor model by allowing for differences in the rates of transversions
and transitions. Felsenstein 81 extends the Jukes-Cantor model by allowing
for different base frequencies. The HKY85 model combines both the Kimura
2 Parameter and the Felsenstein 81 models to allow for different transition
and transversion rates as well as to allow for varying base frequencies while
the General Time Reversible model assumes no difference in the rate of
change between pairs of nucleotides although each pair can have a different
substitution rate to a different pair.
74
In these models a simplifying assumption, at the expense of biological
realism, is that each position in the sequence is equally likely to under go a
substitution. Due to varying functional constraints on different genes and
non coding regions this assumption is known to be false. In an attempt to
overcome the problem models can be modified in order to allow for the rate
of variation of different sites within alignment e.g. HKY85 + Γ (where Γ
stands for rate variation) [21]. The most common distribution used to
describe this rate variation is the gamma distribution [22]. This distribution
has a parameter, α, that defines its shape. When α is small (<1) most sites
have a very low rate of variation but a few have a very large rate of variation.
This results in an L-shaped distribution. As α becomes larger (>1) the
distribution becomes more bell shaped resulting in sites having a smaller
range in rate variation. Estimates of α from various nuclear and
mitochondrial genes [21] range from 0.16 to 1.37 although when codon
positions are analyzed separately the value for the first and second position
is generally much smaller than that for the third position.
The larger gamma value of the third position of codons suggests that there is
generally a higher rate of variation at these sites. This is because many
nucleotide substitutions at these sites do not affect the translation of the
codon [22], i.e. they are synonymous substitutions, and thus they are largely
free from natural selection [23]. Using the ratio between the rates of non
synonymous (dn) and synonymous (ds) substitutions within a genome (dn/ds)
regions of positive selection can easily identified. Once a model of
nucleotide evolution has been selected it can be used in the inference of a
tree with increased accuracy in relation to the genetic distances derived from
the alignment. Such models are generally not used for amino acid
alignments due to the increased complexity in dealing with 20 characters
instead of 4. For amino acids rates of evolution are dependent on the
substitution matrix used.
75
Phylogenetic Trees
Once an alignment has been generated and an appropriate model of
sequence evolution has been selected a phylogenetic tree can be inferred.
Trees can be used to graphically depict the relationship among sequences
within the alignment. A tree is a mathematical structure that can be used to
represent relationships between different objects. The tree itself is consisted
of internal nodes, external nodes and edges (Fig. 2.3, Panel A and B). From
a biological point of view the external nodes (green) are the input
sequences. These can also be called leaf nodes. The internal nodes (blue)
represent the ancestral relationships between these sequences. These
eventually converge to the root of the tree – the estimated most recent
common ancestor. Often a closely related sequence, that is not part of the
sampled dataset (red dotted line), will be added to the alignment so that it is
easier to determine where the true root (red) lies within the group. This is
referred to as an outgroup.
Edge lengths connecting the nodes represent the amount of change that
occurs between each node. In figure 2.3 (panels A and B) because the edge
lengths have been taken into account the tree is referred to as an additive
tree. Edge lengths are calculated directly from the alignment and can vary
depending on the model of evolution that is used. A cladogram is a type of
tree that would only show the relationships between strains relative to each
other without any information on evolutionary distances. Dendrograms are a
special kind of additive tree where all the strains are the same distance from
the root. These can be used to infer a molecular clock by determining the
amount of change that has taken place in relation to time.
During the course of this thesis two different tree inference methods,
neighbour joining [24] and maximum likelihood [25], were used. Neighbour
joining falls into a category of tree inference methods referred to as distance-
76
matrix methods while maximum likelihood comes under the category of
discrete data methods [21].
Figure 2.3 Example of a Phylogenetic Tree
(A) Randomly generated tree consisting of 8 hypothetical strains labelled A – H. Internal
node are marked in blue and represent the ancestral relationships between the outer leaf
nodes (green). The edge lengths represent the expected amount of evolutionary change
between each of the nodes. The red dotted line represents a hypothetical outgroup that
could be used to find the root of the tree (red dot). (B) Identical tree but represented using a
different trigonometric pattern. Displaying trees in various patterns is a task of the tree
viewing software and has nothing to do with the actual tree generation method used.
Neighbour Joining
Neighbour joining relies on having the evolutionary distances between all
pairs of sequences within the dataset. These can be simple p distances or
they can incorporate a model of sequence evolution. Sequences are
heuristically clustered two at a time to eventually produce a tree that
represents the phylogenetic relationships within the input dataset [26]. The
algorithm itself maintains an active node list designated, L, and a growing
77
tree, T. Initially all the leaf nodes are assigned to L and T. During first
interation of the algorithm if two nodes from L, say i and j, are found to be
closest to each other then a new node k is created that will be the direct
ancestor of both i and j. The distance between k and each other nodes
within L is then calculated. Once these distances have been obtained k is
added to the tree. The distance between i and k and j and k are than
calculated. k is added to L while i and j are removed from L (but not from T).
The process of selecting nearest neighbours i and j from L is then repeated.
The algorithm ends when there are just two nodes left in L. These are joined
together by an edge and added to the complete tree.
One issue that must be accounted for when re-constructing a neighbour
joining tree is that the closest pair of leaves are not necessarily neighbours
from an evolutionary perspective. When selecting i and j from L the distance
between them is compensated by subtracting the mean distance each of
them to all other nodes within L. According to a proof presented in [26] this
is guaranteed to select the nearest evolutionary neighbours. Programs and
packages that can be used to calculate neighbour joining trees include the
Phylogenetic Analysis Library (PAL) [27], Geneious [28], Clustal [8-10],
PAUP* [29] and Phylip [30].
The neighbour joining algorithm combines speed and accuracy. The tree
produced reflects the relationships defined in an input distance matrix [4, 31]
– where the distances reflect the expected change since any two paired
sequences diverged. However this makes them susceptible to the quality of
these distances and so care must be taken when choosing an appropriate
model of evolution. Because neighbour joining is a clustering algorithm it
does not implicitly optimize the fit between the tree and the data. A side
effect of this is that only one tree is produced and so there is no way of
viewing other potentially reasonable trees [21] nor can any statistical
information about the correctness of branches be generated. Techniques
78
such as bootstrapping are relied upon in order to infer the reliability of branch
order [21]. Bootstrapping is performed by randomly sampling the columns of
the input multiple alignment (Fig. 2.4). For the random sample a tree is then
inferred. This sampling-inference process is usually repeated 1000 times.
The branch order on the correct tree is then compared to the random trees.
In general the more random samples that support a given branch order on
the tree the more reliable that branch order is. The minimum cut off for
reliability is normally about 70%.
Figure 2.4 Bootstrap Analysis
A neighbour joining tree is (red box) is inferred from the input alignment (blue box).
Columns on the input alignment are then randomly sampled (green boxes) and a tree
inferred. This sampling-inference process is repeated – usually 1000 times. Branches on
the correct tree are compared to branches on the trees from the random samples (circle).
Modified from [32].
79
Maximum Likelihood Trees
For the maximum likelihood approach to tree inference the tree selected is
the tree that gives the highest probability of producing the input multiple
sequence alignment [22]. This is slower than Neighbour joining as many
tree topologies must be examined using an appropriate heuristic. Initially a
model of evolution is chosen. For a single tree the likelihood of producing
each site of the input alignment is then calculated by summing the
probabilities of every possible ancestral state. The likelihood for the full tree
is the product of the likelihood at each site. The Felsenstein algorithm can
be used for calculating these likelihood scores for individual trees [18]. This
is then repeated for all possible tree topologies and the tree with the highest
likelihood score is the tree with the highest probability of producing the input
alignment. However the number of possible tree topologies rapidly
increases with the number of taxa. Generally phylogenetic trees are
bifurcating [22]. This is where the number of edges leading out of each
branch is two. For a bifurcating tree the number of rooted trees topologies
possible for n taxa is given by (2n – 3)! [26]. Thus for 2, 3, 4, 5, 6, 7, 8, 9 ,10
taxa the number of possible trees that must be scored are 1, 3, 15, 105, 945,
10,395, 135, 135, 2027025, and 34459425 respectively. With these huge
numbers of trees the use of heuristic algorithms to traverse fitness peaks
within tree space is required in order to find maximum likelihood trees [33].
Programs and packages that implement such heuristics in order to calculate
maximum likelihood trees include PAL [27], Geneious [28], PAUP* [29] and
Phylip [30]. Maximum likelihood, unlike neighbour joining, does optimize the
fit between the tree and the data as it searches for the tree topology that
most likely gave rise to the input dataset. As a result the probability of
branches being correct can be assigned to the tree. Maximum likelihood is a
widely used inference method and is considered to produce the most
accurate result [25].
80
Once a tree has been generated the next step is to use tree viewing
software to help view and analyze the tree. Many such software are
available and the next chapter discusses one program, CTree [34], that was
developed for this project.
81
References
1. Darwin, C., On the origin of species, ed. L.J. Murray. 1859.
2. Sanger, F. and A.R. Coulson, A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J Mol Biol, 1975. 94(3): p. 441-8.
3. Zuckerkandl, E. and L. Pauling, Molecules as documents of evolutionary history. J Theor Biol, 1965. 8(2): p. 357-66.
4. Felsenstein, J., Inferring Phylogenies. 2003(2): p. 664.
5. Koonin, E.V., Orthologs, paralogs, and evolutionary genomics. Annu Rev Genet, 2005. 39: p. 309-38.
6. Needleman, S.B. and C.D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol, 1970. 48(3): p. 443-53.
7. Chenna, R., et al., Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res, 2003. 31(13): p. 3497-500.
8. Higgins, D.G., J.D. Thompson, and T.J. Gibson, Using CLUSTAL for multiple sequence alignments. Methods Enzymol, 1996. 266: p. 383-402.
9. Larkin, M.A., et al., Clustal W and Clustal X version 2.0. Bioinformatics, 2007. 23(21): p. 2947-8.
10. Thompson, J.D., D.G. Higgins, and T.J. Gibson, CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, 1994. 22(22): p. 4673-80.
11. Edgar, R.C., MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res, 2004. 32(5): p. 1792-7.
12. Henikoff, S. and J.G. Henikoff, Automated assembly of protein blocks for database searching. Nucleic Acids Res, 1991. 19(23): p. 6565-72.
82
13. Dayhoff, M., Survey of new data and computer methods of analysis. Atlas of protein sequence and structure, 1978. 5.
14. Smith, T.F., Waterman, M.S., Comparison of biosequences. Advanced and Applied Mathametics, 1981. 2.
15. Salemi, M., Vandamme, A.M., The Phylogenetic Handbook: A Practical Approach to DNA and Protein Phylogeny. 2004: Cambridge University Press. 406.
16. Jukes, T.H., Cantor, C., Evolution of Protein Molecules. In Mammalian Protein Metabolism. 1969: New York: Academic Press.
17. Kimura, M., A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol, 1980. 16(2): p. 111-20.
18. Felsenstein, J., Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol, 1981. 17(6): p. 368-76.
19. Hasegawa, M., H. Kishino, and T. Yano, Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol, 1985. 22(2): p. 160-74.
20. Rodriguez, F., et al., The general stochastic model of nucleotide substitution. J Theor Biol, 1990. 142(4): p. 485-501.
21. Page, R.D.M.a.H., E.C., Molecular Evolution: a Phylogenetic Approach. 1998.
22. Nei, M.K., S., Molecular Evolution and Phylogenetics. 1333 ed. 2000: Oxford University Press.
23. Miyata, T., T. Yasunaga, and T. Nishida, Nucleotide sequence divergence and functional constraint in mRNA evolution. Proc Natl Acad Sci U S A, 1980. 77(12): p. 7328-32.
24. Saitou, N. and M. Nei, The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol, 1987. 4(4): p. 406-25.
83
25. Steel, M. and D. Penny, Parsimony, likelihood, and the role of models in molecular phylogenetics. Mol Biol Evol, 2000. 17(6): p. 839-50.
26. Durbin, R., Eddy, S., Krogh, A., Mitchison, G., Biological sequence analysis: Probabilistic models of proteins and nucleic acids. 1998: Cambridge University Press.
27. Drummond, A. and K. Strimmer, PAL: an object-oriented programming library for molecular evolution and phylogenetics. Bioinformatics, 2001. 17(7): p. 662-3.
28. Drummond, A.J., Ashton, B., Cheung, M., Heled, J., Kearse, M., Moir, R., Stones-Havas, S., Thierer, T., Wilson, A., Geneious. 2007. p. http://www.geneious.com/.
29. Swofford, D.L., PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). 2003, Sinauer Associates, Sunderland, Massachusetts.
30. Felsenstein, J., PHYLIP (Phylogeny Inference Package). 2005, Department of Genome Sciences, University of Washington, Seattle.
31. Takahashi, K. and M. Nei, Efficiencies of fast algorithms of phylogenetic inference under the criteria of maximum parsimony, minimum evolution, and maximum likelihood when a large number of sequences are used. Mol Biol Evol, 2000. 17(8): p. 1251-8.
32. Baldauf, S.L., Phylogeny for the faint of heart: a tutorial. Trends Genet, 2003. 19(6): p. 345-51.
33. Whelan, S., New approaches to phylogenetic tree search and their application to large numbers of protein alignments. Syst Biol, 2007. 56(5): p. 727-40.
34. Archer, J. and D.L. Robertson, CTree: comparison of clusters between phylogenetic trees made easy. Bioinformatics, 2007. 23(21): p. 2952-3.
84
Chapter 3: CTree - comparison of clusters between phylogenetic trees made easy
Abstract CTree has been designed for the quantification of clusters within viral
phylogenetic tree topologies. Clusters are stored as individual data
structures from which statistical data, such as the Subtype Diversity Ratio
(SDR), Subtype Diversity Variance (SDV) and pairwise distances can be
extracted. This simplifies the quantification of tree topologies in relation to
inter and intra cluster diversity. Here the novel features incorporated within
CTree, including the implementation of a heuristic algorithm for identifying
clusters, are outlined along with the more usual features found within general
tree viewing software. CTree is available as an executable jar file from:
http://www.manchester.ac.uk/bioinformatics/ctree.
85
Introduction
There are many programs available for viewing phylogenetic trees. A
comprehensive list of such programs, maintained by Joe Felsenstein can be
found at http://evolution.genetics.washington.edu/phylip/software.html. Few
however are specialized in the quantification of clustering present on such
trees. Quantifying clusters is biologically important in relation to both
vaccine design and epidemiology [1, 2]. Within CTree (Fig. 3.1) clusters of
strains can be treated as unique data structures – allowing for easy
quantification and comparison. Clusters can be manually populated by the
user or alternatively by use of a novel heuristic clustering algorithm. Use of
such an algorithm removes the subjectivity of the individual when allocating
strains to individual clusters e.g. on random trees or on trees with no
previously identified clusters. CTree also incorporates other useful features
for phylogenetic analysis that are rarely included within current tree viewing
software. The input for CTree is a tree string in NEWICK format (e.g. phylip
output .ph or .phb). Trees can be saved as a pdf, in NEWICK format or as
binary files that will maintain any edits made to them. Implementation was
done using the Java SDK and so CTree is platform independent.
Novel Features
(i) Heuristically Defining Clusters: If the number of taxa present on the tree is
less than 125 a heuristic algorithm can be used to allocate individual strains
to clusters on the tree topology. This feature is useful when no previous
clusters have been published for a particular dataset, as well as for finding
clusters on random trees in a systematic manner (such trees can be used as
a control).
86
(ii) Manually Defining Clusters: The user can manually assign strains to
clusters. This allows for the comparison between previously identfied
clusters on different trees e.g. trees representing HIV-1 groups M and O.
Figure 3.1 Interface for CTree
Screen shot of CTree containing a neighbour joining tree of HIV-1 group M envelope
reference strains. The colours indicate clusters that were manually selected based on the
LANL HIV Sequence database designations (http://hiv-web.lanl.gov). When the heuristic
clustering algorithm is used a similar cluster set for group M is produced. For each cluster
the centre cluster strain (as indicated by the labels) is the strain with the minimum mean
pairwise distance to all other strains within the cluster.
(iii) SDR And SDV: These two statistics are used to quantify the degree of
clustering present on tree topology. The SDR, is defined as the ratio of the
mean intra cluster pairwise distance to the mean inter cluster pairwise
distance [3]. Low intra pairwise distances relative to inter pairwise distances
imply the presence of more defined clusters. The SDV is a measure of the
87
variation within the ratio of the mean intra- cluster pairwise distance to the
mean inter- cluster pairwise distance calculated for each cluster on the tree
[1]. The lower the SDV the more symmetrical the clusters present. Together
these two statistics quantify the presence of clustering within tree topologies.
(iv) Working with Random Trees: CTree provides a random birth model of
exponential population growth for the generation of random trees. The user
can generate a single random tree or specify a number of random trees to
be generated. Terminal nodes from trees generated can be randomly
sampled. Clustering analysis can be performed on the trees with the end
result being a null distribution for the various statistics (including SDR and
SDV).
(iv) Finding the Center of the Tree (COT): COT is the point on a tree with the
smallest average distance from each of the strains on the tree. As well as
having implications for vaccine design [2] it is a useful reference point on a
tree.
The Heuristic clustering Algorithm
The algorithm (Fig. 3.2) is based on finding the cluster set with the minimum
SDR value [3]. The runtime is in the order of n3 where n is the total number
of vertices within the tree. There are two phases:
(i) The Explore Phase: Initially m cluster sets for the input tree are created
during m iterations of steps 1– 6: m is dependent on the number of
incrementations used for the threshold distance (t) in step 2.
Step 1: Each strain on the tree is designated as a potential central cluster
strain – each representing a potential cluster.
88
Step 2: Potential clusters are individually populated by adding strains falling
within t from the central cluster strain.
Step 3: Potential clusters are sorted (largest first). All strains within the
largest potential cluster are removed from all other potential clusters. The
remaining potential clusters are resorted. This step is repeated until there
are no duplicated strains between clusters.
Step 4: Strains belonging to each potential cluster are checked to see
whether or not they are more suited to a different potential cluster. Suitability
is based on an individual member’s proximity to its current potential central
cluster strain and its proximity to central cluster strain’s representing other
potential clusters.
Step 5: The SDR value for the remaining populated potential clusters is
calculated and stored.
Step 6: t is incremented by a predefined amount.
(ii) The Selection Phase: In the selection phase the potential cluster set that
produced the lowest SDR value is selected as the cluster set to represent
the input tree.
89
Figure 3.2 Clustering Sequences on a Tree
Diagramatic representation of the algorithm implemented within CTree for automatically
finding clusters. The upper dashed box corresponds to phase 1 of the algorithm while the
lower dashed box corresponds to phase 2.
90
Standard Features
CTree provides the user with many standard tree viewing features including:
(i) re-rooting the tree, (ii) obtaining statistical information such as pairwise
distances between strains, (iii) swapping the order of sibling strains (iv)
manually removing strains from the tree, (v) removing strains randomly/non-
randomly from the tree, (vi) an improved search interface that allows the
user to color strains based on search criteria, (vii) basic coloring of the tree,
(viii) loading multiple trees from a file containing more than one tree and
conveniently scrolling through them, (ix) allowing the user to obtain lists of
strains within a user specified proximity to each other and (x) allowing the
user to define the distance covered by the scale bar associated with the tree.
Typical Usage
With the range of tree viewing features available within CTree the user can
perform all of the functionalities that would be required to create a
publishable phylogenetic tree. The novel features of CTree do not
complicate this process. A typical usage is illustrated in the study of the
clusters present within HIV-1 group M and O data [1]. Here CTree was used
to manually divide multiple trees from each of the group M and O datasets
into clusters. SDR and SDV values calculated from these clusters were then
compared to each other as well as to heuristically defined clusters present
on 1000 randomly generated trees. The end result was the production of a
definitive model of HIV-1 subtype emergence and the highlighting of the
misleading nature of the group M subtype classification system when used at
the center of the pandemic.
91
Acknowledgements
We are very grateful to Andy Rambaut for his insightful comments and his
advice concerning the generation of random trees. We would also like to
thank John Pinney for related discussions. JA is supported by BBSRC
studentship.
92
References 1. Archer, J. and D.L. Robertson, Understanding the diversification of
HIV-1 groups M and O. Aids, 2007. 21(13): p. 1693-700.
2. Nickle, D.C., et al., Consensus and ancestral state HIV vaccines. Science, 2003. 299(5612): p. 1515-8; author reply 1515-8.
3. Rambaut, A., et al., Human immunodeficiency virus. Phylogeny and the origin of HIV-1. Nature, 2001. 410(6832): p. 1047-8.
93
Chapter 4: Understanding the Diversification of HIV-1 groups M and O
Abstract Objective: To quantify the similarity (or lack of) between the phylogenetic
substructure of HIV-1 groups O and M.
Methods: Two phylogenetic tree statistics the subtype diversity ratio, SDR,
and the subtype diversity variance, SDV, were used in conjunction with
bootstrap replicates on gag, pol and env sequence alignments of group O
and M strains. Randomly generated phylogenetic trees were used as a
control.
Results: We show that as expected the established global-group M
subtypes have a high degree of phylogenetic symmetry in relation to each
other in terms of inter- and intra-subtype diversification. They are
significantly different from the substructure present amongst the random
trees. To the contrary the group O diversification does not display this highly
symmetrical substructure and is not significantly different from the
substructure present on randomly generated trees. Phylogenies comprised
of group M strains from the epicentre of the HIV/AIDS pandemic, the
Democratic Republic of Congo (DRC), exhibit a substructure more similar to
group O than to global-group M.
Conclusions: The substructure present within groups O and M is
quantifiably different. The well-defined clades, the subtypes that
characterize the group M diversification, are not present in group O or
amongst group M strains from the DRC. The group M subtypes are thus
unique and a signature of pandemic HIV-1.
94
Introduction Three different cross species transmission events involving Simian
Immunodeficiency Virus (SIV) from apes have resulted in three distinct
phylogenetic lineages of HIV-1 in humans [1]. These lineages termed
“groups” are labeled M (main), O (Outlier) and N (non-M/O) [2]. Group M is
almost entirely responsible for the global HIV/AIDS pandemic and was
established in the human population as the result of a single cross species
transmission event of SIV from the chimpanzee subspecies Pan troglodytes
troglodytes [3, 4]. Group O has remained endemic to Cameroon and is also
probably the result of a single cross species transmission event involving P.
t. troglodytes. SIVs more closely related to HIV-1 group O have been
recently isolated from the gorilla subspecies Gorilla gorilla gorilla but, as with
humans, it is most likely that they have been acquired from P. t. troglodytes
[1]. HIV-1 group N also most probably originated from P. t. troglodytes [5]
and has been restricted to a few Cameroonians [6].
The HIV-1 strains that comprise group M cluster into distinct and strongly
supported clades in phylogenetic trees. These clusters are referred to as
“subtypes” and have been labeled A to D, F to H, J and K [2]. “E” and “I”
have been relabeled as “Circulating Recombinant Forms” CRF01 and
CRF04, respectively, of which 34 are now designated (LANL HIV Sequence
database, http://hiv-web.lanl.gov). These CRFs are intersubtype
recombinants descended from the same recombination event(s) that infect
multiple individuals and form distinct phylogenetic clusters. With the
exception of subtype K, which is restricted to Central Africa, the subtypes are
distributed within one or more risk groups on a global scale (LANL HIV
Sequence database). Contribution to the group M pandemic by each of the
subtypes varies greatly with C (47%), A (27%) and B (12%) being the most
prevalent [7].
95
It has been estimated that the most recent common ancestor for all group M
strains existed during the early 1930’s [8] and that the DRC was the most
probable location of this strain [9]. Uniquely for a geographical region strains
related to all the nine subtypes have been identified in the DRC along with a
large number of unclassified strains. A high degree of intra-subtype diversity
has also been observed within the region with many strains falling basal to
the subtype ancestral node [10]. The latter study further showed that group
M strains from the DRC had very little organized substructure compared to
their global counterparts, and surmised that the highly structured global
subtypes were a result of chance exportations of DRC strains to new
susceptible global risk groups.
In contrast to group M, strains belonging to group O have been
geographically restricted to Cameroon where in the mid-nineties the group
was responsible for only 1% of HIV-1 infections in the far north of the
country, while in the capital it was responsible for about 6% [11]. Other
studies have shown that in certain areas of the country the prevalence is
lower. For example in the Northwest less than 1% of HIV infections were
observed to be caused by group O [12]. Very few isolated cases outside of
Cameroon have been documented [13] and all have been epidemiologically
linked to Cameroon. It has been estimated that group O’s most recent
common ancestor was present in the human population at about the same
time as group M’s (the 1930s with some considerable error) based on
comparable sequence divergence [14] and more detailed dating-analysis
[15].
The significance of the HIV-1 subtypes, and whether or not group O can be
subdivided into clusters that are biologically equivalent to group M subtypes,
is unclear [14, 16-18]. Resolution of this issue will have implications for any
future group O vaccine design and possibly for drug based intervention
strategies [19]. In this study we have addressed this question by using
96
bootstrapping and two phylogenetic tree metrics that quantitatively measure
the extent of strain clustering. We used these metrics to compare trees
constructed from HIV strains from groups M and O. Randomly constructed
phylogenetic trees were used in the analysis as a control. We extend our
comparative study to group M strains from the DRC and generate a model
for HIV-1 diversification. We do not analyze group N as very few strains let
alone sequences are available.
Methods Tree Metrics. For a phylogenetic tree the subtype diversity ratio (Fig. 4.1, Panel A), SDR, is defined as the ratio of the mean within-cluster (intra)
pairwise distance to the mean between-cluster (inter) pairwise distance [10].
The SDR is therefore a quantitative measure of the extent of clustering found
within a tree. Low intra-cluster pairwise distances relative to inter-cluster
pairwise distances implies more defined clustering in the tree. Thus trees
with lower SDR values are characterized by well defined clusters. An SDR
approaching one would indicate a lack of clustering is present in the tree. As
the SDR does not take into account the variability that can occur between
individual clusters the subtype diversity variance (Fig. 4.1, Panel B), SDV,
was devised. The SDV statistic is a measure of the variation within the ratio
of the mean intra-cluster pairwise distance to the mean inter-cluster pairwise
distance calculated for each cluster on the tree. The lower the SDV value
the more symmetrical, or equidistant, the clusters in a tree are relative to
each other.
97
Figure 4.1 Subtype Diversity Ratio and Subtype Diversity Variance
(A) The relationship between the SDR and the quality of clustering on a phylogenetic tree as
defined in [10] (B) The relationship between the SDR and the variability of the quality of
clustering.
Datasets. To investigate group O and “global” group M phylogenies
sequences from the p24 (gag), p32 (pol) and gp160 (env) genomic regions
were analysed. For each region all of the group O sequences available in
the LANL HIV Sequence Database were obtained. Sequences belonging to
each region were aligned separately using CLUSTAL W [20]. Gap-
containing sites in the alignments were excluded. To exclude clones and
sequences likely to be from the same host sequences that were related by
more than 95% were identified and for any such pairs, or groups, only one
random representative was retained. The final group O datasets contained
98
54, 46 and 53 sequences, respectively. For the global group M data
because there were far more sequences available, a sampling process was
implemented. Strains from the center of the pandemic (DRC) were removed
from this dataset. For each region 110 strains were sampled 100 times from
the LANL HIV Sequence Database. Each of these datasets was then
processed in a similar manner to the group O datasets.
For the group M DRC data, 195 partial env sequences (V3-V5 region)
sampled in 1997 [9] along with 56 sequences sampled during the mid 1980’s
[21] were obtained from the LANL HIV Sequence Database. After
designated recombinants were removed, leaving a total of 230 sequences,
these two datasets were aligned. Columns containing gaps were stripped
from the alignment. A phylogenetic tree was inferred as described below.
The global gp160 sequences from the group M representative tree (see next
paragraph), 288 sequences sampled in 2002 [22] and two sequences
described in [23] were then aligned with the mid 1980’s and 1997 sequences
so that a phylogenetic tree that represented all of the available group M
diversity could be inferred.
Phylogenetic Analysis. The Phylogenetic Analysis Library [24] was used to
create neighbor joining trees for each of the datasets described. The HKY85
model of nucleotide substitution was used with a transition/transversion ratio
of 2. Each tree was divided into clusters from which SDR and SDV values
were calculated. The software implemented to do this analysis (CTree) is
available at http://www.manchester.ac.uk/bioinformatics/resources. The
identification of group O clusters was based on previous analysis [18], while
the group M subtypes were based on the designations that were present in
the LANL HIV sequence database. For each of the group M p24, p32 and
gp160 regions the tree with the SDR value closest to the overall mean SDR
for a given region was chosen as a representative tree for that region.
Bootstrap analysis using PAUP* [25] was performed and the number of
99
bootstrap replicates out of 1,000 supporting each of the subtypes in these
representative trees recorded. For the three group O trees bootstrap values
for the defined clusters were obtained in a similar way.
Random Trees. 1,000 random trees each containing 1,000 terminal nodes
were generated using a random birth model of exponential population
growth. In order to simulate the random sampling of HIV strains from larger
host populations 50 strains were randomly selected from each of the trees.
The trees were divided into clusters using a heuristic algorithm that
minimized the SDR – a similar process to the one described in [10]. SDR
and SDV values where then calculated for each of the trees. CTree was
also used to generate the trees, perform the sampling and divide the trees
into clusters.
Results
The representative tree (Fig. 4.2, Panel A) for the group M envelope region
displays the characteristic ‘starburst’ structure that apparently defines this
group’s phylogenetic diversification. This phylogenetic substructure is
characterized by a double-star phylogeny, i.e., a tendency for long branch
lengths within subtypes that coalesce near the ancestral node of the subtype
and long pre-subtype branches that coalesce near the root of the entire tree
[26]. As a result strains within any given subtype are always more closely
related to each other than they are to strains belonging to a different
subtype. Subtypes that were supported by a bootstrap value of 90% or
greater have been marked with an ‘*’. The trees for the p24 and p32 regions
(Appendix I, Panels A and B) also exhibit this starburst tendency and high
bootstrap support for the designated subtypes. The mean SDR values for
the group M p24, p32 and gp160 regions were 0.47 (±0.004), 0.50 (±0.003)
and 0.49 (±0.002), respectively. The global-group M SDR distribution is
significantly lower than the distribution produced by the randomly generated
100
trees (Fig. 4.3), indicating that the global-group M subtypes are significantly
better defined than would be expected under a random process. The group
M SDV values for the same regions were 0.014 (±0.001), 0.012 (±0.0005)
and 0.017 (±0.0002). These low variances reflect the highly symmetrical or
equidistant nature of the group M phylogenies substructure.
Figure 4.2 Phylogenetic History of HIV-1 Group M and Group O
Inference of the evolutionary histories (phylogenetic trees) of HIV-1 groups M and O. Panel
A displays a tree that was re-constructed using global-group M envelope gp160 sequences.
In panel B the tree was re-constructed using group O gp160 sequences. In panel C the tree
was re-constructed using group M envelope V3-V5 sequences isolated from the DRC region
101
in 1997 [10]. The ‘*’ indicates bootstrap support greater than 90%. The scale bar
corresponds to nucleotide substitutions per site. In panels A and B the bold lettering
corresponds to the subtype designations from the LANL HIV Sequence Database. In panel
C the bold numbers represent the clusters that correspond to previously proposed clusters
[18].
Figure 4.3 Subtype Diversity Ratio Distributions
Frequency distribution of the SDR statistic for global group M and random phylogenies, and
individual group O and DRC phylogenies. The dark grey bars represent the distribution of
SDR values obtained from the global group M data across the p24, p32 and gp160 regions
of the genome. The light grey bars represent the distribution of SDR values that were
obtained from the randomly generated phylogenetic trees. “O” is the location of the mean
SDR value obtained from group O phylogenies for each of the p24, p32 and gp160 regions
of genome. “M_DRC” is the SDR value obtained from a group M phylogeny from the DRC
region in 1997 for the V3-V5 region of the genome.
In the group O phylogenetic tree for the gp160 region (Fig. 4.2, Panel B) the
highly symmetrical starburst tendency is absent. The topology consists of
weakly defined clusters with the cluster ancestral nodes falling deep within
the tree. This weak clustering is maintained across the p24 and p32 regions
of the genome (Appendix II, Panels A and B). The SDR values for the p24,
p32 and gp160 regions were 0.58, 0.55 and 0.58 respectively. In each case
these are significantly larger than the corresponding group M values (t-test, p
< 0.001) confirming the weak clustering. Furthermore the mean group O
102
SDR value of 0.57 (labeled by “O”) falls inside the SDR distribution produced
by the random trees (Fig. 4.3). For group O the p24, p32 and gp160 regions
produced SDV values of 0.07, 0.03 and 0.03, respectively, indicating a lack
of symmetry between the defined clusters. These values are all significantly
larger than the corresponding group M values implying that the group O
clusters are quantifiably less symmetrical than their group M counterparts.
As a result of the weakly defined group O clusters it is not possible to be
confident that any two individual strains from a given cluster are more closely
related to each other than they are to strains from a different cluster. For
example strains O.CM.98.98CMA123 and O.CM.96.96CMABB009 both
belonging to cluster III and have an evolutionary distance between them of
0.188 nucleotide substitutions per site. However the distance between
O.CM.98.98CMA123 and O.CM.98.98CMA307, which belongs to cluster II,
is 0.185 nucleotide substitutions per site. Bootstrap support for the group O
cluster ancestral node is also not as consistent as for the group M trees. For
example on the tree representing the p32 region only two clusters have a
bootstrap support of more than 90%.
We also performed the SDR/SDV analysis on phylogenies inferred from the
V3-V5 region of env. 100 truncated group M sequences (corresponding to
the strains in Fig. 4.2, Panel A) and all 97 group O sequences available in
this region were used. The SDR for group O was 0.632 while the SDV was
0.033, and for group M the values were 0.54 and 0.01, respectively. The
group O SDR value still falls within the random tree SDR distribution
presented in figure 3 while the group M SDR value fell outside 95% of the
random values. In the case of the group M tree (not shown), clusters
representing subtypes were still clearly defined (as in Fig. 4.2, Panel A).
This was not the case for the group O trees (not shown) where the topology
was similar to Fig. 4.2, Panel B. Note, we do not analyze complete genome
group O sequences as there are only 21 available. These are not a
103
representative sample of group O diversity and this number is also too low
for reliable testing.
Fig. 4.2, Panel C shows the tree re-constructed from the group M data for
the V3-V5 region sampled from the DRC [9, 21]. The characteristic well
defined starburst shape that defines the global-group M tree is absent. This
is due to the shorter evolutionary branch lengths that define the subtypes A,
D, J, G, F and K being less prominent. In Fig. 4.3 the SDR value of 0.68
(labeled “M_DRC”), obtained from the 1980’s and 1997 DRC strains, falls
inside the SDR distribution produced by the random trees and significantly
outside the SDR distribution produced by the global-group M, consistent with
Rambaut’s et al.’s previous study [10]. However the SDV value for the group
M DRC data (0.0045) was significantly lower than the SDV produced by the
global-group M data indicating the presence of older more established
endemic lineages. In a similar way to the group O clusters and unlike the
global-group M subtypes, two individual strains belonging to the same
subtype are not necessarily more closely related to each other that they are
to a strain belonging to a different subtype. For example strains
A.CD.97.MBS26 and A.CD.97.KP28 both labeled as subtype A have an
evolutionary distance between them of 0.193 nucleotide substitutions per
site. However the distance between A.CD.97.KP28 and D.CD.97.KS2,
which is labeled as subtype D is less (0.169 nucleotide substitutions per
site). In Fig. 4.4, the strains that make up figures 4.2 Panels A and C are
combined with the 288 DRC sequences sampled in 2002 [22] as well as the
2 strains described in [23]. The SDR value for this tree was 0.67 while the
SDV was 0.009. Strains that have been isolated from within the DRC region
(black) can be observed to fall deep within the tree in relation to the clusters
of globally isolated strains (red). Indeed the diversity within the subtypes
after the inclusion of the strains from the DRC has increased markedly.
104
Figure 4.4 The Center of the Group M Pandemic
Inference of the evolutionary history of global group M and DRC HIV-1 strains. The black
branches correspond to DRC data from [9, 21-23]. The red branches correspond to group
M global data seen in Fig. 4.2, Panel A.
Discussion
In this study we have demonstrated significant differences between strains
falling into HIV-1 groups M and O in relation to their pattern of clustering in
phylogenetic trees. In contrast to group M the significantly larger SDR
values (Fig. 4.3) produced by the previously proposed [17, 18] group O
105
clusters (Fig. 4.2, Panel B), defined for the gag, pol and env regions of the
genome, confirm and quantify their weaker structure relative to the distinct
group M subtypes (Fig. 4.2, Panel A). The lack of equidistant group O
clusters or lack of a ‘starburst’ structure is observed in Fig. 4.2, Panel B, and
is quantified by the significantly higher SDV values when compared to group
M. It would seem that the group O phylogeny reflects an epidemiology that
is dependent on host transmissions on a highly localized (endemic) scale
[26]. As a result the branches leading to the group O clusters tend to be
deeper within the trees resulting in shorter branch lengths leading to the
individual clusters. This also results in weaker bootstrap support for the
clusters.
To the contrary the group M global subtypes have emerged during the
course of the complicated epidemiological history arising from the HIV/AIDS
pandemic. The nature of the global subtype emergence is quantified by the
significant difference obtained between the group’s SDR distribution and the
SDR distribution produced from tree topologies generated under a random
model of exponential population growth (Fig. 4.3). The latter represents
phylogenetic topologies that would be expected within an exponentially
expanding endemic viral population where founder effects are absent [10].
The distinctness of the global-group M subtypes is thus a direct result of the
history of the pandemic and the spread of HIV/AIDS across the world. The
absence of strong founder effects followed by relatively isolated
diversification in relation to group O has dramatically reduced the extent of
strain clustering within this group’s phylogeny. As a result the group O
cluster SDR values fall significantly inside the values produced by clusters
defined on randomly generated trees (Fig. 4.3).
In light of these quantifiable differences that exist between the phylogenies
of HIV-1 groups M and O, we propose in agreement with earlier work [14]
that, despite both groups being in existence for a similar length of time, it is
106
not appropriate to draw direct parallels between the proposed clusters that
exist within the group O phylogeny and the well established global subtypes
that exist within the group M phylogeny. Group O clusters do exist but they
have evolved on a highly localized scale and are less distinct and less
epidemiologically informative that group M subtypes.
In a similar manner to group O, group M phylogenies including sequences
from Central Africa, the epicentre of the pandemic (reviewed in [27]), do not
exhibit the strong effect of strain exportation (founder effects). This has
resulted in the presence of weaker clusters within phylogenetic trees re-
constructed from strains sampled within the region and is observed in Fig.
4.2, Panel C where subtype ancestral nodes fall deep within the tree. This
weaker clustering of strains is quantified by the SDR value calculated from
the tree, which like the values for group O fall significantly within the SDR
distribution from the random trees (Fig. 4.3). When global-group M strains
are added to a phylogenetic tree, re-constructed from HIV strains from the
DRC region, they continue to form tight clusters amongst the more diverse
DRC strains (Fig. 4.4). Many of the DRC strains fall on the pre-subtype
branches that previously distinguished the global subtypes and thus the
clear distinction between the subtypes is lost. As a consequence within the
DRC region the current global subtype classification system is not
particularly meaningful.
Fig. 4.5 depicts our proposed model of how the current global group M
subtypes diversified. The importance of different globally isolated host sub-
populations (circled in grey) as well as recombination is emphasized. In the
model after the initial successful cross species transmission event, a huge
diversification occurred within the relatively isolated DRC region resulting in
loosely defined divergent lineages that reflect localized viral epidemiology.
The initial founder effects caused by the random exportation of strains out of
the epicentre into previously unexposed globally isolated host populations,
107
followed by subsequent rapid diversification within such populations, has
resulted in the very well defined global subtypes with a large genetic
distance between each. Note that at no point is recombination absent,
rather it occurs across a continuum of sequence divergence – between
identical strains, strains of the same subtype and from different subtypes –
with varying degrees of probability related to the sequence identity and other
factors [28-33]. Occasionally recombinants will form epidemiologically
significant clusters, for example the CRFs [2]. These can be the result of
relatively recent recombination such as the CRF10_CD, which is found in
Tanzania or as a result of much older recombinant events that can be traced
back to the early days of the pandemic, for example CRF02_AG, which can
be found in Central Africa. In such a case being a CRF is indistinguishable
from being a subtype, in that they merely represent an exportation event
from the DRC that involved a recombinant that happened to be relatively
more closely related to the strain(s) that formed subtypes A and G.
According to our model the global classification system cannot reflect the
extent of diversity present within the epicentre of the pandemic.
108
Figure 4.5 Model of HIV-1 Group M Subtype Emergence
Diagrammatic model representing the evolution and diversification of HIV-1 group M. Each
circle represents a defined phylogenetic grouping. The central circle represents the
epicentre of the group M pandemic focused on the DRC region. Note this is merely a
representation as the epicentre will not follow country boundaries. The grey circles
represent random strains that were exported from the epicentre. These strains are linked to
the founder effects in the new host populations (represented by the outer circles) in different
geographic regions. The bottleneck event that occurred during the initial colonization of
these new populations resulted in the apparently long evolutionary branches (dotted lines)
that lead to each of the current group M global subtypes. Inclusion of sequences from
Central Africa infections (particularly the DRC) blurs this distinction because so many strains
fall on the pre-subtype branches. Within each isolated host population the evolutionary
factors driving diversification will be similar and include the frequent occurrence of viral
mutation, replication recombination [28-33], and positive/diversifying selection [34, 35].
109
The subtype classification system has been devised through the sampling of
strains globally and presently strains from the DRC region are classified
according to the global subtype that they fall closest to. However the
distinctness of these “globally”-defined “pre-DRC” data subtypes is due to
individual strains being exported from the DRC region. They do not
accurately represent the diversity that is present within the centre of the
pandemic. Classifying strains from the DRC region according to the current
global classification system misrepresents this extensive diversity and is
potentially misleading. For example, a proposal for putative new subtypes
within the DRC region, for strains 83CD003 and 90CD.121E12 [23], should
be treated with caution and not labeled as a subtype [22] at this time. Such
divergent strains (Fig. 4.4) are a consequence of the diversity of HIV-1 in
Central Africa and as such are not epidemiologically significant in the same
way as the pandemic-associated clusters. The HIV-1 diversity in the DRC
represents a continuum of genetic variation, with the previous distinctness of
the subtypes an artifact of both (1) biased sampling, and (2) strong founder
events as viruses were exported outside of the DRC region, seeding the
HIV/AIDS pandemic (Fig. 4.5). In this context, the identification of 83CD003
and 90CD121E12 as a “new clade” (and as a potential subtype if a related
strain is identified) is of little consequence. The point is not that these DRC
infections are unimportant but that any lineage, such as the one identified by
Mokili et al. [23], should be recognized as a small part of the overall diversity
present in Central Africa, even when it forms an apparently unique lineage.
Crucially, conclusions concerning global-group M subtypes in relation to
vaccine programs using subtype consensus sequences and geographical
knowledge, cannot be directly transferred to the weakly supported group O
clusters or the poorly defined clusters present in the DRC region. When
choosing strains for the generation of a potential consensus vaccine [19] it
will be important to focus on sequences that are relevant to a particular
geographic region that are circulating at the present time, and not bias the
110
inference by the inclusion of highly divergent strains. Indeed subtypes given
their historical diversity will often be of limited relevance to the strains that
are likely to be circulating in the future and for which a hypothetical vaccine
will have to elicit immunity to. A more predictive approach needs to be taken
if rational vaccine design is to become a reality.
In conclusion, the significance of the HIV-1 group M subtypes has been a
puzzle for many years. Here we present a model that is a definitive
representation of HIV diversification and evolution. Any subtype specific
differences if they exist will have most probably arisen outside of the DRC
and will be the result of passaging of the virus in association with different
risks groups or the result of frequent passaging in association with relatively
large infected populations. When analysing HIV particularly in the case of
vaccine design it is crucial that the consideration of its diversity be grounded
in an explicit evolutionary framework.
Acknowledgements
We are very grateful to Andy Rambaut for his insightful comments and his
advice concerning the generation of random trees. We also wish to thank
Michael Worobey and John Pinney for helpful comments and discussion. JA
is supported by BBSRC studentship.
111
References
1. Van Heuverswyn, F., et al., Human immunodeficiency viruses: SIV infection in wild gorillas. Nature, 2006. 444(7116): p. 164.
2. Robertson, D.L., et al., HIV-1 nomenclature proposal. Science, 2000. 288(5463): p. 55-6.
3. Bibollet-Ruche, F., et al., Complete genome analysis of one of the earliest SIVcpzPtt strains from Gabon (SIVcpzGAB2). AIDS Res Hum Retroviruses, 2004. 20(12): p. 1377-81.
4. Gao, F., et al., Origin of HIV-1 in the chimpanzee Pan troglodytes troglodytes. Nature, 1999. 397(6718): p. 436-41.
5. Keele, B.F., et al., Chimpanzee reservoirs of pandemic and nonpandemic HIV-1. Science, 2006. 313(5786): p. 523-6.
6. Yamaguchi, J., et al., Identification of HIV type 1 group N infections in a husband and wife in Cameroon: viral genome sequences provide evidence for horizontal transmission. AIDS Res Hum Retroviruses, 2006. 22(1): p. 83-92.
7. Takeb, E.Y., S. Kusagawa, and K. Motomura, Molecular epidemiology of HIV: tracking AIDS pandemic. Pediatr Int, 2004. 46(2): p. 236-44.
8. Korber, B., et al., Timing the ancestor of the HIV-1 pandemic strains. Science, 2000. 288(5472): p. 1789-96.
9. Vidal, N., et al., Unprecedented degree of human immunodeficiency virus type 1 (HIV-1) group M genetic diversity in the Democratic Republic of Congo suggests that the HIV-1 pandemic originated in Central Africa. J Virol, 2000. 74(22): p. 10498-507.
10. Rambaut, A., et al., Human immunodeficiency virus. Phylogeny and the origin of HIV-1. Nature, 2001. 410(6832): p. 1047-8.
11. Mauclere, P., et al., Serological and virological characterization of HIV-1 group O infection in Cameroon. Aids, 1997. 11(4): p. 445-53.
112
12. Yamaguchi, J., et al., HIV infections in northwestern Cameroon: identification of HIV type 1 group O and dual HIV type 1 group M and group O infections. AIDS Res Hum Retroviruses, 2004. 20(9): p. 944-57.
13. Quinones-Mateu, M.B., SC. Arts, EJ., Role of Human Immunodeficiency Virus Type 1 Group O in the AIDS Pandemic. AIDS Rev, 2000. 2: p. 190–202.
14. Roques, P., et al., Phylogenetic analysis of 49 newly derived HIV-1 group O strains: high viral diversity but no group M-like subtype structure. Virology, 2002. 302(2): p. 259-73.
15. Lemey, P., et al., The molecular population genetics of HIV-1 group O. Genetics, 2004. 167(3): p. 1059-68.
16. Loussert-Ajaka, I., et al., Variability of human immunodeficiency virus type 1 group O strains isolated from Cameroonian patients living in France. J Virol, 1995. 69(9): p. 5640-9.
17. Yamaguchi, J., et al., Near full-length genomes of 15 HIV type 1 group O isolates. AIDS Res Hum Retroviruses, 2003. 19(11): p. 979-88.
18. Yamaguchi, J., et al., Evaluation of HIV type 1 group O isolates: identification of five phylogenetic clusters. AIDS Res Hum Retroviruses, 2002. 18(4): p. 269-82.
19. Heeney, J.L., A.G. Dalgleish, and R.A. Weiss, Origins of HIV and the evolution of resistance to AIDS. Science, 2006. 313(5786): p. 462-6.
20. Thompson, J.D., D.G. Higgins, and T.J. Gibson, CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, 1994. 22(22): p. 4673-80.
21. Kalish, M.L., et al., Recombinant viruses and early global HIV-1 epidemic. Emerg Infect Dis, 2004. 10(7): p. 1227-34.
22. Vidal, N., et al., Distribution of HIV-1 variants in the Democratic Republic of Congo suggests increase of subtype C in Kinshasa
113
between 1997 and 2002. J Acquir Immune Defic Syndr, 2005. 40(4): p. 456-62.
23. Mokili, J.L., et al., Identification of a novel clade of human immunodeficiency virus type 1 in Democratic Republic of Congo. AIDS Res Hum Retroviruses, 2002. 18(11): p. 817-23.
24. Drummond, A. and K. Strimmer, PAL: an object-oriented programming library for molecular evolution and phylogenetics. Bioinformatics, 2001. 17(7): p. 662-3.
25. Swofford, D., PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). 2003, Sinauer Associates, Sunderland, Massachusetts.
26. Grenfell, B.T., et al., Unifying the epidemiological and evolutionary dynamics of pathogens. Science, 2004. 303(5656): p. 327-32.
27. Worobey, M., Global HIV/AIDS Medicine. The Origins and Diversification of HIV (2007), ed. M.L. Sande, J. Volberding, P. Greene, WC. 2006: Elsevier, Philadelphia.
28. Baird, H.A., et al., Sequence determinants of breakpoint location during HIV-1 intersubtype recombination. Nucleic Acids Res, 2006. 34(18): p. 5203-16.
29. Fang, G., et al., Recombination following superinfection by HIV-1. Aids, 2004. 18(2): p. 153-9.
30. Pernas, M., et al., A dual superinfection and recombination within HIV-1 subtype B 12 years after primoinfection. J Acquir Immune Defic Syndr, 2006. 42(1): p. 12-8.
31. Robertson, D.L., et al., Recombination in HIV-1. Nature, 1995. 374(6518): p. 124-6.
32. Shriner, D., et al., Pervasive genomic recombination of HIV-1 in vivo. Genetics, 2004. 167(4): p. 1573-83.
33. Taylor, J.E. and B.T. Korber, HIV-1 intra-subtype superinfection rates: estimates using a structured coalescent with recombination. Infect Genet Evol, 2005. 5(1): p. 85-95.
114
34. Choisy, M., et al., Comparative study of adaptive molecular evolution in different human immunodeficiency virus groups and subtypes. J Virol, 2004. 78(4): p. 1962-70.
35. Yang, W., J.P. Bielawski, and Z. Yang, Widespread adaptive evolution in the human immunodeficiency virus type 1 genome. J Mol Evol, 2003. 57(2): p. 212-21.
115
Chapter 5: Prediction of HIV-1 Recombination Breakpoint Location
Abstract In retroviruses recombination occurs during negative strand DNA synthesis
as a result of switching between genomic RNA templates in a process
known as copy choice. By analyzing 149 previously described recombinant
sequences [1] it was observed that the frequency of breakpoint location was
2.7 times lower than the random expectation within regions of poor
sequence identity. We use this relationship to define a probabilistic model of
in vitro HIV-1 breakpoint prediction. Using previously described parental
sequences, breakpoint locations predicted by the model for the gp120 region
were not significantly different from the in vitro generated breakpoint
locations. The model was then used to generate a probabilistic expectation
for the distribution of breakpoints across the entire HIV-1 genome. It was
observed, with the exception of the regions either side of the envelope gene
and short stretches within the gag and pol genes, that the distribution of
breakpoints did not depart significantly from the distribution obtained from
recombinant strains isolated from individuals during the course of the global
HIV-1 group M pandemic. In genomic regions where the predicted and
observed distributions did not significantly depart from each other, we
propose that a purely probabilistic process is sufficient to explain breakpoint
distribution. Furthermore we propose that the key recombination events in
terms of HIV persistence are those that result in the envelope region being
recombined between strains in a viral population.
116
Introduction
Owing to a number of biological factors including a high mutation rate [2],
rapid viral turnover [3] and a high recombination rate [4], the diversity seen
between strains of HIV-1 group M (HIV Los Almos Sequence database) is
astounding when compared to other rapidly evolving viral genomes such as
influenza [5]. Despite this extensive diversity, when represented on a
phylogenetic tree, HIV-1 group M can be divided into well defined clusters.
Nine of these clusters, termed subtypes - labelled A to D, F to H, J and K [6],
are consistent in their phylogenetic topology in relation to each other
regardless of the section of their genome being compared [7]. The
remaining clusters differ in their phylogenetic relationships in a manner that
is dependant on the region of the genome being examined [8]. This is due to
strains falling into these clusters, termed Circulating Recombinant Forms
(CRF), being comprised of mosaic genomes that originated from different
‘parent’ clusters. Such recombinant forms, along with Unique Recombinant
Forms (URFs), arise following dual infection by two (or more) divergent
strains of the virus [9-13] during (-) strand DNA synthesis in a process known
as copy choice recombination [4, 14, 15].
To date there are 34 CRFs published in the Los Alamos HIV Sequence
database. This number is steadily increasing however with newly emerging
CRFs frequently being discovered. It has been estimated that up to 15% of
viral genomes within an individual who has not been dually infected have
been altered by recombination events that occurred after the initial exposure
to the virus [16]. From the number of recombinant forms contributing to the
global pandemic at both an intra- and inter- host level it is clear that
recombination contributes a great deal to the phylogeny of HIV-1 group M
and thus is a major factor influencing diversity that must be considered when
developing any future vaccine.
117
We have previously observed that breakpoint positions were distributed
across the gp120 region non-randomly following copy choice recombination
within a single cycle system [1]. This non random distribution of breakpoints
has also been observed in vivo across the entire genome [17, 18].
Positioning of breakpoints is believed to be influenced by a number of
mechanistic factors including high sequence identity [1, 15], secondary RNA
structure [19, 20] and the location of runs of identical nucleotides referred to
as homopolymeric stretches (HPS’s) [1, 21]. The latter two are believed to
increase the probability of a breakpoint occurring by stalling the reverse
transcriptase complex during DNA synthesis [22-24], which in turn promotes
the induction of copy choice within regions of high sequence identity.
However, not all recombinant strains are viable.
Preliminary findings [25] suggest that in an in vitro multiple cycle system over
75% of HIV-1 recombinants generated in a dual infection are not able to
replicate. Additionally, if copy choice in vivo does result in a recombinant
genome with the ability to replicate, that mosaic genome must survive
selective pressure from the hosts immune response. Understanding this
selective pressure is vital to the understanding of HIV’s evolution in relation
to drug resistance [26, 27], generation of escape mutants [28, 29], disease
progression [30] and diversity. The ability to compare the locations of
breakpoints present within viable recombinant sequences derived from
established strains contributing to the global pandemic, to the mechanistic
probability of the distribution of breakpoints in the absence of natural
selection, would allow an insight into the selection pressures placed on the
HIV genome.
Here we create a probabilistic model that takes into account sequence
identity between two parental sequences in order to create an expected in
vitro distribution of breakpoints. Sequence identity is accounted for by
disallowing breakpoints to be created directly on mismatches and by
118
reducing the probability of a breakpoint occurring within a window of
constant size anchored to the left of each mismatch. Windows are anchored
on the left hand side of mismatches as reverse transcriptase traverses the
RNA from the 3’ end to the 5’ end during –ve strand DNA synthesis. An
overview of the model is found in figure 5.1 while a detailed description is
presented in the methods section. Breakpoint distributions generated by the
model, using partial gp120 gene sequences from subtypes A and D as
parentals, are compared to the in vitro breakpoint distributions that were
previously generated [1]. No significant difference between the models
predicted distribution and the in vitro distribution were found, confirming that
sequence identity is the major influencing factor in relation to the positioning
of breakpoints.
Known breakpoint locations occurring within full length HIV-1 group M
sequences that have been sampled from the group M pandemic were then
compared to the distributions produced by the model using different
combinations of pairs of HIV-1 group M subtype reference strains as
parental sequences. Across much of the HIV genome, it was observed that
the model predicted distributions did not differ significantly from the in vivo
breakpoint distribution frequencies. In the regions where the distributions did
differ, two areas were identified on either side of the envelope gene where
more than the expected numbers of breakpoints were present within the
global data, suggesting a positive selection for breakpoints within these
regions.
119
Figure 5.1 Prediction of Recombinant Breakpoints
As the RT complex moves from the 3’ end of the donor RNA to the 5’ end it reverse
transcribes the RNA sequence. The RNAse activity of the RT complex is indicated by the
light grey nucleotides on the donor RNA. The nascent –ve DNA strand can be observed
tailing the RT complex. The acceptor RNA strand is aligned alongside the donor strand and
ready for a potential crossover event (known as copy choice) – indicated by the dotted
arrowed line. The dotted boxes indicate windows, of decreased probability of crossover,
that have been anchored on the mismatch at the 3’ end of the potential breakpoint zone.
The probability at each base on the acceptor strand of a crossover occurring is indicated by
p1-3. p1 is the probability of a crossover occurring on a mismatch, p2 is the probability of a
crossover occurring on a base within a window and p3 is the probability of a crossover
occurring away from a mismatch and outside of a window. P1 is set to 0. p2 and p3 depend
on the relationship between the frequency of breakpoints occurring within breakpoint zones
of size 5 or less and the frequency of breakpoints occurring within breakpoints zones greater
than size 5. The plot along the bottom is a representation of each of the probability values
along the sequence. For this stretch of 16 nucleotides the total probability of a crossover
occurring is given by the equation shown.
120
Results
Model Parameters: In figure 5.2, the normalized distribution of breakpoints
falling into breakpoint zones ranging in size from 0 to 25 is represented
(vertical grey bars). A breakpoint zone is the identical region within an
alignment between any two consecutive mismatches. It can be observed
that as the identity between parental sequences increases (larger zone
sizes) so to does the frequency of breakpoint occurrence. As the in vitro
data was limited to 149 recombinant sequences, there are zones where
breakpoints are under represented – most notably zones of size 12, 16, 22
and 25. In zones of size 19 no in vitro breakpoints were observed. The
expected (random) distribution of breakpoints within different zone sizes has
been added to the plot (black horizontal bars). In zone sizes 3, 4 and 5 the
in vitro breakpoint frequencies fall outside 1.645 standard deviations (90%)
of the expected distribution implying, in agreement with [1], that the lack of
sequence identity around these small zones is reducing the occurrence of
breakpoints. The first significant increase in the in vitro breakpoint
frequencies occurs between zones of size 5 and 6. From zone size 6 and
upwards, where there is more identity present between the parental
sequences the majority of in vitro breakpoint frequencies fall within or above
the expected frequencies. The exceptions to this are zones of size 10, 16,
19, 22 and 25 where the lack of data most probably disrupts the general
trend. No breakpoints were observed directly on mismatches in the in vitro
data.
When the per nucleotide frequency of both the in vitro breakpoints and the
random expectations are organized into groups of size 5 and compared (Fig.
5.2 inset), the significant decrease in the frequency of in vitro breakpoints
(vertical bars) within zones of smaller size (1 - 5) in relation to the random
expectations is confirmed. For the next two groups representing zone sizes
from 6 to 15 no significant difference exists between the in vitro breakpoints
121
and the random expectations. In group 16 to 20 the in vitro breakpoints are
significantly below the random expectations but here this is as a result of the
sparsity of the in vitro data within these zones. In the last group the in vitro
breakpoint frequencies are significantly higher than the expected breakpoint
frequencies. This is possibly due to a combination of factors including
increased sequence identity, HPS positioning (the larger the zone size the
more chance of HPS’s being present), secondary RNA structure and the
lower frequency of in vitro breakpoints within smaller zones.
Since the first significant increase in frequency occurs between zone sizes of
5 and 6 the window size parameter for our model is set to 5. To estimate the
reduction in the probability of breakpoint occurring within windows of size 5 a
line of best fit was drawn through the in vitro breakpoint frequency data
falling within zones of size 5 or less. A second line of best fit was drawn
through the in vitro frequency data falling within zones of greater than size 5.
The ratio between the slopes of the two lines indicated that the gradient of
zones of size less (or equal) than 5 was 0.37 times that of zone of size 5 or
greater (Appendix III). This ratio was used to estimate the reduced
probability of a breakpoint occurring within windows of size 5.
122
Figure 5.2 Effects of Sequence Identity on Breakpoint Location
The main plot displays the normalized distribution of in vitro breakpoints falling within zones
ranging from size 0 to 25 (grey vertical bars). The horizontal bars indicate the expected
random distribution of breakpoint within each individual zone. The inset plot shows the per
nucleotide frequency of both the in vitro breakpoints and randomly generated breakpoints
for zones up to size 25 (arranged in groups of 5).
In Vitro Breakpoint Predictions: Figure 5.3a compares the in vitro generated
breakpoint distributions (vertical grey bars) to the random expected
distribution (horizontal black bars). The random probabilities are display a
flat distribution across the lengths of the parental sequences. A chi squared
test indicates that there is a significant difference (p<0.001) between the
randomly expected distributions and the in vitro distributions confirming that
the in vitro breakpoints are not randomly positioned. In figure 5.3b the
probability distributions obtained when taking only mismatches into account
(i.e. the probability of breakpoint occurring on a mismatch is 0) are
observed. The in vitro distributions can be observed to fall within 1.645
standard deviations for 5 out of the 10 predicted frequencies. However
statistically there is still a significant difference between the expected
frequencies within each grouping and the in vitro frequencies (p<0.01). In
123
figure 5.3c, where the probability distributions produced by the full model are
displayed, 8 out of the 10 real breakpoint frequencies fall within the 1.654
standard deviations from the predicted values. There is no significant
difference (p>0.05) between the expected frequencies and the in vitro
frequencies. This suggests that our model works well for predicting in vitro
breakpoint positions. In the regions (200 – 300 and 800 – 900) where the in
vitro data is just within reach of the model predictions other factors
influencing breakpoint positioning that our model does not take into account
may have a stronger influence.
124
Figure 5.3 Predicted Breakpoint Distributions for gp120
In panel A the predicted locations of random breakpoints are displayed as horizontal black
bars. The in vitro distributions are represented by the vertical grey bars. A significant
difference (p<0.001) existed between the simulated breakpoints and the randomly
generated ones. In panel B the predicted distributions obtained by the simulated taking
mismatch locations only into account are represented by the horizontal black bars. Once
again a significant difference (p<0.01) between the predicted frequencies existed with the in
vitro data. The horizontal black bars in panel C represent the predicted frequency values
125
produced when using the full model. No significant difference existed between these values
and the observed frequencies.
Global (in vivo) Breakpoints: In figure 5.4, model predicted breakpoint
distributions for the entire HIV-1 genome are compared to the known in vivo
global breakpoints distribution frequencies. Because of the limited number
of persistent global breakpoints (324) available, as well as the length of the
HIV genome (approx 10, 000bp before gab stripping), the data is arranged
into windows of size 400. The relative region of the genome can be seen
along the x-axis. It can be observed that with the exception of short regions
within gag and pol genes as well as within env all the global frequencies fall
within 1.645 standard deviations from the predicted values. The global
breakpoint frequencies at the end of the gag and pol genes as well as within
the centre of the env gene are significantly below the model predictions
indicating that there is a suppression of breakpoint persistence within these
areas of the genome. At both the start and end of the env gene the global
breakpoint frequencies are above the model predicted frequencies,
suggesting an increased rate of breakpoint persistence.
Figure 5.4 Predicted Breakpoint Distributions the for Entire Genome
126
Model predicted breakpoints locations using full length HIV-1 genomic sequences, as
described in the materials and methods section, are displayed by the horizontal black bars.
The distribution of in vivo breakpoints (as published in [17]) is represented by the vertical
grey bars. Dark grey indicates where the global data falls significantly below the predicted
distribution for and region in question, light grey indicates where the global data falls within
the predicted distributions while white indicates where the global data falls above the model
predicted values. The frequency data has been divided up into window sizes of 400. Along
the x-axis the various genomic regions of the HIV-1 genome are displayed.
Discussion
Previously we observed that the in vitro mechanics of HIV-1 recombination
appear to be determined by quantifiable mechanical characteristics of the
parental sequences involved [1]. Here we demonstrate that sequence
identity is a major influencing factor on in vitro breakpoint positioning. We
use this to define a probabilistic model that can accurately predict the
location of in vitro recombinant breakpoints. From our initial data (Fig. 5.2)
we calculated that within breakpoint zones of size 5 nucleotides or less there
were 2.7 times fewer breakpoints than within all other zone sizes. Using this
ratio within our model, frequency predictions were obtained that show no
significant differences (p>0.05) to in vitro generated breakpoint distributions
across the partial gp120 genomic region (Fig. 5.3c). These predictions could
be further improved by incorporating other known influences on breakpoint
positioning, such as RNA secondary structure and HPS location, into the
mechanics of the model presented. This is a viable option for future
research as, like that of sequence identity, the influence of RNA secondary
structure and HPS’s on the positioning of breakpoints in vitro is mechanistic
[31].
Our model works well for predicting the locations of in vitro recombinant
breakpoints. However it cannot be directly applied to the group M global
pandemic. This is because, within an individual host, a newly generated
127
mosaic genome will not necessarily be viable [25]. Natural selection will
further result in many recombinant sequences not being able to contribute to
the pandemic in any significant way thus influencing the persistence of
observed in vivo breakpoints. The production of ‘fit’ recombinant sequences
in vivo goes beyond simple mechanical factors determined by the parental
sequences. For both in vitro breakpoints and model predictions, such
‘fitness’ is not tested. For our probabilistic model to be applied to the global
pandemic, regions where natural selection has a strong influence on
breakpoint persistence must be identified and accounted for.
Conversely our model can be very useful in identifying such regions. With a
plausible method of predicting recombinant breakpoints in the absence of
natural selection, a comparison can be made with breakpoints present within
the globally sampled data. Areas that significantly deviate from the model
predicted breakpoint distributions are potentially those where the effects of
selection pressures due to the host immune response are strongest. This
can be observed in figure 5.4, a representation of both real and predicted
breakpoints across the entire HIV-1 genome where, with the exception of the
env region and a short stretch in each of the gag and pol regions, all the
group M in vivo breakpoint location frequencies fall inside 1.645 standard
deviations from the predicted breakpoint frequencies.
The exceptions, with particular emphasis on the env gene, fall within regions
known to be under high selective pressure [32]. By comparison to our model
predictions it can be observed that on either side of the env gene there is a
severe under prediction of in vivo breakpoints. This indicates a tendency for
the in vivo shuttling of the entire envelope region as an intact unit between
strains. Whether this is a result of coincidental or sequential recombination
events, it indicates that selection is promoting env’s transfer from one
genetic background into another effectively as an integral cassette. This
must be directly related to the envelope protein’s functional significance in
128
relation to viral fitness determinants, its propensity to be subject to high
levels of positive selection, and the importance of the action of the immune
response on HIV’s envelope gene [33, 34].
The paucity of recombination breakpoints within the envelope gene itself (but
also in parts of gag and pol) indicates that selection is suppressing
recombination within these regions, which is presumably due to inter-
dependencies within gene regions, for example, for the maintenance of
structural and functional integrity in the context of high viral diversity. Within
the envelope region, such inter-dependencies could be partially related to
the involvement in coreceptor binding of the V1, V2 V3 and C4 regions [35,
36].
That the frequency of recombination in other genomic regions does
not depart significantly from the model predicted expectation (based simply
on the sequence identity within the parental sequences) indicates that purely
mechanistic processes are sufficient to explain breakpoint locations within
these regions. More importantly, in such regions the role of recombination in
relation to immune evasion seems to be limited. Our results emphasise that
detailed mapping of individual recombinant structures, while important,
should be considered in the context of a probabilistic expectation generated
by purely mechanistic processes. The recombination breakpoints that are of
most importance to understanding the HIV/AIDS pandemic are those that
are over- represented against this expectation.
Here we have presented how sequence identity can be used to create a
model that accurately simulates recombination in the absence of natural
selection. We have discussed the usefulness of such a model and how it
needs to be modified before it can be directly applied to the group M
pandemic. These modifications are very tractable and the result would be a
model that could robustly predict the dynamic future of the emerging group
129
M recombinant strains. This would have very far reaching implications in
terms of any future prophylactic strategies in terms of anticipating and
preventing the continuous emergence of future virulent clusters.
Methods In Vitro Recombinant Breakpoints
The in vitro recombinant gp120 sequences used are described in [1]. In
summary five parental sequences were used. Two belonged to subtype A
(115A and 120A), while the other three belonged to subtype D (89D, 122D
and 126D). From these isolates the seven pairs of parentals were:
126D/120A, 115A/126D, 115A/89D, 120A/89D, 89D/120A, 126D/122D and
115A/120A. The number of recombinant sequences generated for each
pair, using a single cycle infection system, was 39, 23, 25, 21, 22, 23 and 22
respectively.
In Vitro Breakpoint Distributions
Location Distributions: The frequency of breakpoints occurring at different
locations across the recombinant sequences was previously obtained [1].
For each of the parental pairs these location frequencies were organized into
groups of 100 (e.g. 1 – 100, 101 – 200 etc). Corresponding locations were
pooled across each of the parental pairs.
Distribution within Breakpoint Zones of Different Sizes: The data from each
parental pair was pooled together and the frequency of breakpoints falling
within breakpoint zones of particular sizes was calculated. Normalization
was done for a zone of particular size by dividing the frequency of breakpoint
occurrence by the total number of potential breakpoint zones of that size.
The maximum zone size used was 25 as for larger zone sizes the limited
number and length of sequences meant that data was sparse. The per
130
nucleotide frequency of breakpoints occurring within zones of each size up
to size 25 was calculated.
Predicted Recombinant Breakpoints
Three independent methods were used to generate predicted breakpoint
frequencies for the parentals described. These were (i) Random breakpoint
prediction, (ii) Breakpoint prediction based on mismatch locations only and
(iii) breakpoint prediction using our full model. Random breakpoint prediction: The probability of creating a breakpoint, pb,
on any site within the parental alignment is given by
€
pb =1n
, (1)
where n is the length of the alignment.
Breakpoint Prediction Based on Mismatch Locations: To take sequence
identity into account, the probability of a breakpoint being located on a
mismatch is reduced to zero. The probability of creating a breakpoint on any
site that is not a mismatch, pb', becomes
€
pb '=1
n -m, (2)
where m is the number of mismatches in the alignment.
Breakpoint prediction based on full model: In the full model (Fig. 5.1) there
are three different categories of site. These are: (i) sites located on
mismatches, (ii) sites located within windows of size five nucleotides
downstream of a mismatch and (iii) sites located neither on a mismatch nor
within a window. At each type of site, the probability of a breakpoint
occurring is given by p1, p2 or p3 respectively. Across the full alignment, the
sum of probabilities over all sites is 1. The model can therefore be
summarized as
€
mp1 + wp2 + (n - (m + w))p3 =1, (3)
131
where w is the number of nucleotides falling within a window. Since
breakpoints are not allowed on mismatches, p1 is set to zero. We further
define the ratio
€
α =p2p3
(4)
to represent the factor by which the probability of recombination is reduced
within a window. From equation (3), the model parameters can therefore be
expressed as
)- w(1- m- n p2 α
α= (5)
and
)- w(1- m- n1 p3 α
= . (6)
To estimate the value of α, a line of best fit was drawn through the
normalised in vitro breakpoint frequency data for zones of size 5 or less. A
second line of best fit was drawn through the data for zones of greater than
size 5. Since the gradient of such a line corresponds to the average
recombination frequency associated with a single nucleotide falling in a
specific category (window / non-window), the ratio of the gradients can be
used to give a value of α = 0.37 (Appendix III).
Random Breakpoint Distributions
For each site in a parental alignment the random probability of a breakpoint
occurring was given by equation (1). The parentals used were the same as
those described for the in vitro data.
Location Distributions: For each parental pair sites were organized into
groups of 100 for which the individual probabilities were summed (e.g. 1 –
100, 101 – 200 etc). These summed probabilities were weighted according
to the number of in vitro breakpoints that were present for the same parental
132
pair. Corresponding groups across each of the parentals were then summed
to give the expected distribution.
Distribution within Breakpoint Zones of Different Sizes: The total number of
sites falling into breakpoint zones of a particular size was calculated by
multiplying the size by the total number of occurrences of the zone. The
random probability of a breakpoint falling on a site within any zone was then
calculated by dividing 1 by the total number of sites within all potential zones.
For potential zones of a particular size the expected probability of a
breakpoint occurring was calculated by multiplying the probability of a
breakpoint falling on a site within any zone by the length of the potential
zone. The expected probability was then weighted accordingly by
multiplying by the total number of in vitro breakpoints observed within that
zone size.
Breakpoint Distributions incorporating Mismatch Influence
This is similar to the “Location Distributions” from the previous section except
that the probability of a breakpoint occurring on a mismatch was 0. The
probabilities for the remaining sites were calculated using equation (2).
Model Breakpoint distributions
Similarly to above - the probability of a breakpoint occurring on a mismatch
was 0. Here however equations (5) and (6) were used to calculate the
probabilities across the remaining sites.
Global CRFs
In Vivo Breakpoint distributions: 324 breakpoints from circulating CRFs, over
the full length of the HIV-1 genome, with known locations and parental
subtypes [17] were used to obtain the distribution of breakpoints across the
genome in groups of 400.
133
Global CRF Model Expectations: The probabilities of breakpoints occurring
at individual sites were calculated from equations (5) and (6). The
probability of a breakpoint occurring on a site where there was a mismatch
between the two parental sequences was 0. The parentals used were the
global group M reference strains that represented the various subtypes seen
to be present within the CRFs [17]. The accession numbers of the parental
strains used were – AF069670 (subtype A), K03455 (subtype B), AF067155
(subtype C), U88824 (subtype D), AF005494 (subtype F), AF061641
(subtype G), AF190128 (subtype H), AF082394 (subtype J), and AJ249235
(subtype K). For each parental pair sites were organized into groups of 400
for which the individual probabilities were summed (e.g. 200 – 600, 601 –
1001 etc). These summed probabilities were weighted according to the
number of in vivo breakpoints that were observed for the same parental pair.
These numbers were: GH (5), HJ (1), JG (10), GK (6), AB (2), DF (7), BF
(66), JA (8), AG (35), DA (83), DC (19), CG (5), CA (50), AK (4), FK (7), CB
(14) and GB (2). Corresponding groups across each of the parentals were
then summed to give the expected distribution that could be directly
compared to the in vivo data.
134
References
1. Baird, H.A., et al., Sequence determinants of breakpoint location during HIV-1 intersubtype recombination. Nucleic Acids Res, 2006. 34(18): p. 5203-16.
2. Malim, M.H. and M. Emerman, HIV-1 sequence variation: drift, shift, and attenuation. Cell, 2001. 104(4): p. 469-72.
3. Wei, X., et al., Viral dynamics in human immunodeficiency virus type 1 infection. Nature, 1995. 373(6510): p. 117-22.
4. Jetzt, A.E., et al., High rate of recombination throughout the human immunodeficiency virus type 1 genome. J Virol, 2000. 74(3): p. 1234-40.
5. Korber, B., et al., Evolutionary and immunological implications of contemporary HIV-1 variation. Br Med Bull, 2001. 58: p. 19-42.
6. Robertson, D.L., et al., HIV-1 nomenclature proposal. Science, 2000. 288(5463): p. 55-6.
7. Archer, J. and D.L. Robertson, Understanding the diversification of HIV-1 groups M and O. Aids, 2007. 21(13): p. 1693-700.
8. Robertson, D.L., et al., Recombination in HIV-1. Nature, 1995. 374(6518): p. 124-6.
9. Boisier, P., et al., Nationwide HIV prevalence survey in general population in Niger. Trop Med Int Health, 2004. 9(11): p. 1161-6.
10. Diaz, R.S., et al., Dual human immunodeficiency virus type 1 infection and recombination in a dually exposed transfusion recipient. The Transfusion Safety Study Group. J Virol, 1995. 69(6): p. 3273-81.
11. Fang, G., et al., Recombination following superinfection by HIV-1. Aids, 2004. 18(2): p. 153-9.
12. Salminen, M.O., et al., Evolution and probable transmission of intersubtype recombinant human immunodeficiency virus type 1 in a Zambian couple. J Virol, 1997. 71(4): p. 2647-55.
135
13. Yerly, S., et al., HIV-1 co/super-infection in intravenous drug users. Aids, 2004. 18(10): p. 1413-21.
14. Yu, H., et al., The nature of human immunodeficiency virus type 1 strand transfers. J Biol Chem, 1998. 273(43): p. 28384-91.
15. Zhang, J. and H.M. Temin, Retrovirus recombination depends on the length of sequence identity and is not error prone. J Virol, 1994. 68(4): p. 2409-14.
16. Taylor, J.E. and B.T. Korber, HIV-1 intra-subtype superinfection rates: estimates using a structured coalescent with recombination. Infect Genet Evol, 2005. 5(1): p. 85-95.
17. Fan, J., M. Negroni, and D.L. Robertson, The distribution of HIV-1 recombination breakpoints. Infect Genet Evol, 2007. 7(6): p. 717-23.
18. Magiorkinis, G., et al., In vivo characteristics of human immunodeficiency virus type 1 intersubtype recombination: determination of hot spots and correlation with sequence similarity. J Gen Virol, 2003. 84(Pt 10): p. 2715-22.
19. Galetto, R., et al., Dissection of a circumscribed recombination hot spot in HIV-1 after a single infectious cycle. J Biol Chem, 2006. 281(5): p. 2711-20.
20. Moumen, A., et al., The HIV-1 repeated sequence R as a robust hot-spot for copy-choice recombination. Nucleic Acids Res, 2001. 29(18): p. 3814-21.
21. Klarmann, G.J., C.A. Schauber, and B.D. Preston, Template-directed pausing of DNA synthesis by HIV-1 reverse transcriptase during polymerization of HIV-1 sequences in vitro. J Biol Chem, 1993. 268(13): p. 9793-802.
22. Derebail, S.S. and J.J. DeStefano, Mechanistic analysis of pause site-dependent and -independent recombinogenic strand transfer from structurally diverse regions of the HIV genome. J Biol Chem, 2004. 279(46): p. 47446-54.
136
23. Lanciault, C. and J.J. Champoux, Pausing during reverse transcription increases the rate of retroviral recombination. J Virol, 2006. 80(5): p. 2483-94.
24. Roda, R.H., et al., Strand transfer occurs in retroviruses by a pause-initiated two-step mechanism. J Biol Chem, 2002. 277(49): p. 46900-11.
25. Baird, H.A., et al., Influence of sequence identity and unique breakpoints on the frequency of intersubtype HIV-1 recombination. Retrovirology, 2006. 3: p. 91.
26. Margot, N.A., J.M. Waters, and M.D. Miller, In vitro human immunodeficiency virus type 1 resistance selections with combinations of tenofovir and emtricitabine or abacavir and lamivudine. Antimicrob Agents Chemother, 2006. 50(12): p. 4087-95.
27. Perno, C.F., V. Svicher, and F. Ceccherini-Silberstein, Novel drug resistance mutations in HIV: recognition and clinical relevance. AIDS Rev, 2006. 8(4): p. 179-90.
28. Guillon, C., et al., Evidence for CTL-mediated selection of Tat and Rev mutants after the onset of the asymptomatic period during HIV type 1 infection. AIDS Res Hum Retroviruses, 2006. 22(12): p. 1283-92.
29. Pillay, T., et al., Unique acquisition of cytotoxic T-lymphocyte escape mutants in infant human immunodeficiency virus type 1 infection. J Virol, 2005. 79(18): p. 12100-5.
30. Iversen, A.K., et al., Conflicting selective forces affect T cell receptor contacts in an immunodominant human immunodeficiency virus epitope. Nat Immunol, 2006. 7(2): p. 179-89.
31. Konstantinova, P., et al., Hairpin-induced tRNA-mediated (HITME) recombination in HIV-1. Nucleic Acids Res, 2006. 34(8): p. 2206-18.
32. Seibert, S.A., et al., Natural selection on the gag, pol, and env genes of human immunodeficiency virus 1 (HIV-1). Mol Biol Evol, 1995. 12(5): p. 803-13.
137
33. Choisy, M., et al., Comparative study of adaptive molecular evolution in different human immunodeficiency virus groups and subtypes. J Virol, 2004. 78(4): p. 1962-70.
34. Marozsan, A.J., et al., Differences in the fitness of two diverse wild-type human immunodeficiency virus type 1 isolates are related to the efficiency of cell binding and entry. J Virol, 2005. 79(11): p. 7121-34.
35. Cardozo, T., et al., Structural basis for coreceptor selectivity by the HIV type 1 V3 loop. AIDS Res Hum Retroviruses, 2007. 23(3): p. 415-26.
36. Hoffman, N.G., et al., Variability in the human immunodeficiency virus type 1 gp120 Env protein linked to phenotype-associated changes in the V3 loop. J Virol, 2002. 76(8): p. 3852-64.
138
Chapter 6: A Strategy for Identifying Significant HIV Sequence Diversity Using Structural Constraints
Abstract We present a strategy for identifying meaningful sequence diversity to aid
HIV vaccine design. The approach is based on the identification of likely
amino acid replacements in the context of physical constraints imposed by
three-dimensional protein structure. We apply our model to HIV-1 Gag and
demonstrate that prediction of amino acids present in 1,100 group M
sequences is statistically significant. The structure-based model is, thus,
accurately identifying the important diversity at specific sites. Our approach
has a tendency to under-predict residues, mainly due to the presence of
improbable amino acids in sequence data; residues in the observed data
that are of limited relevance since they are either present in non-viable
viruses or result from sequencing artifacts. These residues can be identified
because they lack shape complementarity with the rest of the protein. Were
the virus to synthesize proteins with these replacements, an unfolded non-
functional protein would likely result. These improbable replacements can,
thus, be considered functionally and evolutionary irrelevant and so do not
need to be considered in vaccine design. To assess the effect of removing
these unlikely residues we use a novel algorithm to quantify the number of
optimised sequence-constructs required to “cover” observed HIV-1 P17
diversity. Applying this strategy to two large HIV-1 data sets (>1,000
sequences) we achieve a marked improvement in mean coverage estimates.
For example, the number of constructs required for 90% coverage of P17
alignments decreases from 133 to 71 for group M and for subtype B from 60
to 30. Thus, removal of improbable amino acids is a rational approach to
reducing the diversity present in HIV sequence alignments. Incorporation of
further functional and immunological constraints will permit prediction of viral
evolution that is both significant to the immune response and increasingly
accurate.
139
Introduction The greatest impediment to producing an HIV vaccine is the high rate of viral
evolution. This is a consequence of HIV’s high rate of mutation [1],
propensity to recombine [2, 3], high rate of viral turnover [4, 5] and the
actions of the immune response promoting positive selection [6-8]. The
ability of HIV to change rapidly leads to high levels of sequence diversity
both within and between infected individuals, enabling a persistent infection
to be maintained despite the actions of the immune response [9-11]. To be
effective in such a situation, a preventative vaccine needs to be protective in
the face of the many divergent viral variants present in the infected human
population. Additionally a therapeutic vaccine needs to protect against the
generation of new variants in the context of one individual’s on-going
infection. For this reason extreme sequence diversity severely complicates
the choice of candidates for a potential HIV vaccine.
A significant body of work exists on the appropriate consideration of HIV
variation in vaccine design strategies and various methods have been
discussed. These include explicit consideration of HIV-1 group M’s
evolutionary history to choose “central” candidates, for example, ancestral
and consensus sequences, and “centre of tree” sequences based on single
or multiple subtypes [12-15]. These approaches assume that all of the
diversity observed in the sequence data sampled from the pandemic is
meaningful. However, a subset of HIV’s diversity will be of limited
importance and so does not need to be considered in vaccine preparations,
be it a preventative or therapeutic approach. This “irrelevant” variation will
be either present in non-viable viruses or results from sequencing artefacts.
Here we present a strategy for identifying the non-significant changes.
These residues can be identified because they lack shape complementarity
with the rest of the protein. When we model these replacements, there are
substantial unfeasible overlaps between atoms (see Fig. 6.1A for an
140
example). Were the virus to synthesize proteins with these replacements an
unfolded and non-functional protein would likely result. These residues will
be present in real sequence data but predicted to be highly improbable by
our model. If real they would produce non-functional viruses, and so will not
contribute to escape from the immune response. Thus, they are
evolutionary irrelevant.
Figure 6.1 Predicting HIV Evolution at Individual Sites
(A) Matrix protein P17’s structure and illustration of the “goodness-of-fit” aspect of the
model. Each site in P17 has every amino acid substituted computationally (panel B). In the
inset, position 85 (leucine) is replaced with methionine (dark blue). In this conformation the
methionine has substantial van der Waals overlaps, as illustrated by the pink and yellow
spikes, and so would be considered an improbable replacement. (B) The algorithm used for
the predictive model. All low energy side chain conformations (“rotamers”) are tested, and
the molecular interactions between this residue and the rest of the protein are calculated
with PROBE to determine the “goodness-of-fit” at a specific site. Residues that pass this
test are subsequently filtered using a substitution matrix to determine whether the
substitution/replacement is likely. See methods for further details.
Our predictive strategy can be combined with “coverage” approaches which
involve the identification of multiple candidates for inclusion in artificial
sequence-constructs and a potential polyvalent (or “cocktail”) vaccine [16,
141
17]. The objective is to build mosaic sequences that will optimally cover the
diversity of HIV sequences encountered by the immune system allowing
recognition of the whole diverse set of viruses, or at the very least viruses
corresponding to a particular subtype of epidemiological significance. There
are, however, practical problems with including too many sequence-
constructs in the cocktail [15]. The challenge, therefore, is to describe the
maximum amount of observed HIV sequence diversity with the fewest
number of artificial sequence-constructs. This involves the identification of
multiple vaccine antigen-constructs that are optimised to maximise coverage
of circulating variants and that include high-frequency epitopes [18, 19].
As a proof of concept we apply our strategy to HIV-1’s Matrix protein, P17.
P17 is processed from the Gag polyprotein, is associated with the inner viral
membrane, and has a number of functions essential for viral assembly and
entry into the infected cell [20]. We have chosen this protein primarily
because of its known immunogenic importance; for example see Iversen et
al. [21]. We demonstrate that (i) amino acid usage at individual sites is
significantly constrained, (ii) our model accurately predicts real viral diversity
and (iii) removal of improbable amino acids leads to a marked increase in
coverage estimates. To increase the accuracy of the predictions of viral
evolution further structural and functional information can be incorporated
into the model.
Results Predicting viral evolution
We first use our model (Fig. 6.1B) to predict the probable and improbable
replacements at each site in HIV-1’s Matrix protein, P17 (Fig. 6.2A). The
improbable amino acids are of two types.
142
143
Figure 6.2 True Positives, False Positives and True Negatives
(A) Venn diagram showing the relationship between observed and predicted amino acids at
individual sites in a sequence alignment. The intersect between the predicted and observed
amino acids corresponds to our correct predictions (true positives, TP; blue). Improbable
amino acids are those that have not been observed to be used by HIV (true negatives, TN;
green) or have been observed in sequence data but are not predicted to be important (false
negatives, FN; red). “Other constraints” accounts for those predictions not observed in the
sequence data (false positives, FP; yellow); if these constraints were included in the model
this over-predicting would not occur. Note, the areas are not to scale and are for illustrative
purposes only. (B) The number of residues predicted by the model at each site in P17 (total
bar height). The number of these that are observed in the alignment (TPs) and the number
predicted but not observed (FPs) are shown. The black squares denote sites that are
significantly different from random (P<0.05). (C) Bar chart depicting the frequencies of
residues predicted (black) and observed (light grey) in P17. The number observed per site
is taken from the alignment of 1,091 unique HIV-1 group M sequences. (D) The number of
residues observed (total bar height). The proportion of these that are predicted (TPs) and
the proportion observed but not predicted (FPs) are shown. Stars indicate sites inferred to
be under the influence of positive selection [7]. The black squares denote sites that are
significantly different from random (P<0.05). (E) The number of residues not observed in
the alignment (total bar height). The number of these not predicted and not observed (TNs),
and the number predicted but not observed (FPs) are shown. The black squares denote
sites that are significantly different from random (P<0.05).
Firstly, residues not observed to have been ever used by HIV. Secondly, the
residues observed in the available sequence data that are evolutionarily
irrelevant because they give rise to non-functional viruses. In all we find 110
positions out of 115 are predicted to change, i.e., exhibit more than one
probable residue per site (Appendix IV and total bar height in Fig. 6.2B).
The distribution of these predicted replacements is shown in Fig. 6.2C (black
bars). This strikingly demonstrates the high degree of constraint at individual
sites such that the mean number of predicted amino acids per site is only
five, with the maximum being nine. Note, the consensus residue was not
predicted for only six sites (Fig. 6.3). These rare erroneous predictions are
presumably the result of the P17 structure having been solved for one
144
sequence only, a problem that would be resolved with the availability of more
P17 structures.
Figure 6.3 Amino Acid Frequencies within P17
The amino acid frequencies of all true positives (blue) versus all false negatives (red) at
each site in P17.
To determine the accuracy of the predictions we compare them to the
number of residues observed at each site in the group M P17 sequence
alignment (indicated by total bar height in Fig. 6.2D). Of the 115 sites in the
alignment, only one site was 100% conserved within all the sequences. The
mean number of amino acids per site was observed to be seven while the
maximum was 15. Comparing the distribution of these observed residues to
those predicted (Fig 6.2B) we find they are not significantly different
(P=0.052, Mann-Whitney test). The result is marginal as there is a clear
tendency for the real data to include more amino acids than predicted (Fig.
6.2C and D). If this under-prediction reflects the presence of mostly
irrelevant amino acids and not the failing of the model, then the residues not
predicted (FNs) should occur at significantly lower frequencies than those
that are predicted (TPs). We find this is the case with FNs occurring at
significantly lower frequencies, 2% on average versus 25% for TPs (Fig. 6.3;
P < 0.0001, Wilcoxon rank-sum test). Of the 207 amino acids that only occur
once in the group M data, 70% are FNs. In addition, six of the eight
positively selected sites that have been identified in the P17 region [7]
correspond significantly to sites with high numbers of observed amino acids
145
(Fig. 6.2D, P < 0.05, Wilcoxon rank-sum test). We under-predict at these
sites due, presumably, to co-variation effects not accounted for by the model
(see Discussion).
To investigate the statistical significance of our predictions we calculate
positive predictive values, sensitivity and specificity (Appendix V), and
compare these to 10,000 random simulations. A positive predictive value
quantifies how often our predictions are correct, i.e., present in the real data
(TP), out of all the predicted residues at each site (blue bars relative to total
bar height in Fig. 6.2B) compared with random predictions of the same
number of amino acids at each site. The mean of each random simulation
was never greater than the mean positive predictive value (0.71; P < 0.0001)
and these values were statistically significant at 58% of the sites (P < 0.05,
Appendix V and indicated with black squares in Fig. 6.2B).
In terms of the sensitivity of our method, i.e., whether we predict all the
amino acids at a site, we have established the model under-predicts at the
majority of sites, which will lead to lowered sensitivity. This is certainly the
case as indicated by blue bars relative to total bar height in Fig. 6.2D.
Nonetheless the mean of each random simulation was never greater than
the mean sensitivity (0.61; P < 0.0001) and these values were statistically
significant at 63% of sites (P < 0.05, Appendix V and indicated with black
squares in Fig. 6.2E). In terms of specificity, i.e., how often we predict a
residue not to be present, the results were also significant (mean = 0.89; P <
0.0001) with 65% of sites being statistically significant (P < 0.05, Appendix V
and indicated with black squares in Fig. 6.2E). This confirms that our
predictions of probable amino acid replacements are highly accurate as the
model is making very few incorrect predictions.
146
Using predictions to inform vaccine design
In order to quantify coverage requirements, we first determine the
distribution of the number of unique nine-mers present in HIV-1 sequences
across P17 for the global group M and for subtype B data using large
sequence alignments: 1,091 and 1,405 unique sequences, respectively (Fig.
6.4A and B, respectively). We focus on nine-residue fragments because
there has been much discussion of the possibility of a T-cell based vaccine
[22]. A potential coverage-approach is, thus, to produce a polyvalent T-cell
based vaccine comprised of optimised nine-residue peptide fragments
(“nine-mers”; the average length of relevant epitopes) that occur in the
observed HIV-1 sequences [16, 17]. The graphs (Fig. 6.4A and B) show the
highly heterogeneous nature of all nine-mers across P17. Next we exclude
the improbable amino acid residues (specifically FNs) from the P17
alignments, as these do not need to be considered in coverage estimates.
This dramatically reduces the numbers of nine-mer combinations (Fig. 6.4A
and B) with the average falling from 142 to 98 for group M and from 109 to
73 for subtype B. So as to be conservative, in all analysis, all residues were
retained at the five sites for which the consensus residue was not predicted
and the sites identified as being positively selected because the model is
clearly not predicting accurately at these positions.
To determine the impact of removing improbable amino acids in this way, a
novel algorithm (Fig. 6.5) that selects high-frequency nine-mers in a
sequence alignment was developed. The algorithm optimizes the process of
sequence construction such that the initial constructs provide the most
coverage. Unlike previous approaches, a given number of sequence-
constructs guarantees a certain percentage of cover. For example, to
achieve 75% coverage of the group M sequences, 26 sequence constructs
are required (Fig. 6.4C).
147
Figure 6.4 Distribution of Nine-mers in Relation to Coverage
The distribution of nine-mers across P17 alignments for (A) 1,091 unique HIV-1 group M
and (B) 1,405 unique subtype B sequences for all data (blue) and after removal of
improbable amino acids (orange). The coverage provided by sequence-constructs for the
same (C) HIV-1 group M and (D) subtype B alignments for: all data/no model (blue), after
removal of improbable amino acids as determined by our model for both all of P17 (orange)
and, for subtype B only, an optimal epitope region [23] from sites 11 to 44 (green). Note, the
latter CTL rich region corresponds to sites 8 to 41 in our alignment (transparent green box).
148
Figure 6.5 Generation of Optimised Sequence Constructs
Schematic representation of the algorithm used to make the optimised sequence-constructs.
Straight arrows represent transitions between the steps of the algorithm. Circular arrows
indicate where multiple iterations of a step occur. Boxes represent list objects.
When we apply our model to exclude improbable residues we require only
16 sequence constructs, while for 90% coverage the number of sequences
falls from 133 to 71. Focusing just on subtype B (Fig. 6.4D), the number of
optimised sequence-constructs is even lower still: 30 as opposed to 60 to
give 90% coverage with and without the model, respectively. Applying our
model to a CTL dense region in P17 (Fig. 6.4B), for 75% and 90% coverage
only 3 and 18 sequence-constructs are required (Fig. 6.4D), while for 95 and
98 coverage 41 and 95 sequence-constructs are required, respectively.
149
Although these numbers of sequence-constructs are still high for current
vaccine technology, these results demonstrate how HIV’s seemingly
enormous diversity can be reduced to potentially manageable proportions by
focusing on the prediction of probable versus improbable evolution at
individual sites.
Application of the replacement model, thus, leads to a dramatic decrease in
the number of sequence constructs required at all levels of coverage and
performs markedly better than consensus sequences (dashed lines in Fig.
6.4C and D). The reduction in the number of nine-mers required for a given
level of coverage, after use of the model, is approximately equivalent to
removing those amino acid residues that occur at a frequency of 1% and
0.5% for the group M and subtype B alignments, respectively (Fig. 6.6). A
straight-forward amino acid frequency cut-off will inevitably lead to a
reduction in the number of nine-mers required because it artificially reduces
the sequence diversity from that actually observed in the HIV population.
However, although there is a significant tendency for the identified
improbable residues (FNs) to be at low frequencies a minority are not (19%
exceed a frequency of 1% in the group M alignment), while a high proportion
of probable residues (TPs) are at low frequencies (47% are at a frequency
lower than 1% in the group M alignment). Thus, although in terms of
coverage the model and specific percentage threshold are similar, the model
reduces diversity in a justifiable manner.
150
Figure 6.6 Sequence Coverage
The coverage provided by sequence-constructs for (A) the HIV-1 group M and (B) subtype B
alignments for: all data/no model (dark blue), after removal of improbable amino acids as
determined by our model (orange), removal of amino acids occurring at a frequency of less
than 0.5% (green), 1% (purple) and 2% (light blue). The mean coverage for a random
control is shown (black); the vertical error bars include 95% of the coverage scores. The
dashed black line indicates coverage provided by the alignment’s consensus sequence.
Note, these are the same distributions as in Fig. 6.4C and D with the inclusion of percentage
cut-offs.
Discussion Our model, despite being relatively simple, accurately predicts the subset of
amino acids that are present in the majority of P17’s variable sites (Fig. 6.2).
The approach works because there are strong structural constraints acting
on viral proteins such that the amino acids that can occur at individual sites
are restricted. This link between sequence and structure imposes physical
restrictions both on which sites will change and the nature of the change that
can occur; consequentially structure affects the distribution of amino acids
observed. Inclusion of additional constraints at functional sites [20], the
requirement to maintain binding interfaces [24], and intra- and inter-
molecular interactions [25] will place further limits on change. The model
can also be improved by the introduction of structural plasticity [26, 27]. The
quantification of these additional constraints on viral evolution will permit,
151
with even greater accuracy than achieved here, the prediction of the
evolutionary trajectory of HIV at specific sites.
Protein structural and functional constraints determine whether an amino
acid altering change is neutral (of little or no fitness-cost), prone to purifying
selection (a high cost-of-fitness to the virus in terms of structure and/or
function) or positively selected (immediately beneficial to viral fitness). As a
consequence the so-called high “cost-of-fitness” residue replacements
associated with escape mutations [28] are likely to be improbable according
to our model (Fig. 6.2A). Thus, at these highly specific sites our results are
very probably underestimates; although some of these may correspond to
the positively selected sites for which all residues were retained in our
analysis. Nonetheless our approach based on prediction of probable amino
acid replacements, in the context of protein structure, is the first attempt to
understand the limits of viral evolution. Crucially, the high cost-of-fitness
mutations associated with immune escape require compensatory mutation(s)
[29-31]. Indeed it is the occurrence of compensatory mutation that changes
an amino acid’s replacement from improbable to probable. Thus,
understanding in greater detail the nature of compensatory co-variation both
in terms of intra- and inter-molecular interactions will permit the design of
increasingly sophisticated and representative sequence-constructs.
Ultimately a detailed understanding of the limits of HIV’s ability to change will
permit the accurate deduction of the extent of HIV variation that any vaccine
must protect against. The key to success in realising a viable vaccine will be
the identification of low numbers of optimised (and immunogenic) sequence-
constructs, while simultaneously increasing percentage coverage. The
difficulty in achieving this is dramatically highlighted by the non-linear
relationship between increasing percentage coverage and numbers of
sequence-constructs (Fig. 6.4C and D). This means that coverage of 75% of
the diversity in a large subtype B sequence alignment can be provided by
152
just six optimised sequence-constructs but further increases in coverage
require disproportionately more artificial sequences, such that 50 constructs
are required for 93% coverage and 100 for 97% coverage. This is because
achievement of very high levels must provide coverage of the more variable
regions of P17 (Fig. 6.4A and B).
In addition to dealing with HIV’s diversity, and probably the most difficult task
in vaccine design, the model-based sequence-constructs will have to be
further improved by manipulating them to be immunogenic [14, 18, 19, 32,
33]. Notwithstanding complexities such as immunodominance and cross-
reactivity [22, 33, 34], this will require the identification of a subset of mosaic-
construct sequences that are optimised for processing and immune
recognition. The importance of our approach is that it is a strategy for
reducing the amount of HIV variation that will need to be considered in this
endeavor. Previous approaches, because they ignore structural and
functional constraints, will have over-estimated the diversity that the immune
response has to deal with.
In conclusion, we can make rational predictions about significant viral
variation based on what are possible, probable, and importantly improbable
amino acid replacements at specific locations. Thus, a predictive strategy is
a more effective paradigm for vaccine design than trying to cope with all
global HIV-1 sequence diversity. This is primarily because we are focussing
on the requirement of both infecting strains and escape mutants to be viable,
as opposed to the often error-laden and functionally, let alone evolutionary,
irrelevant sequence data that has been sampled from the HIV/AIDS
pandemic. Rational vaccine design, if it is to succeed, must pay greater
attention to quantifying the subset of HIV variation that contributes to on-
going evolution. It is this diversity that will include the variants of
immunological significance in the future.
153
Methods The model
The protein structure 1HIW was selected from the Protein Data Bank [35] as
being representative of HIV-1’s Matrix protein, P17 [36]. Hydrogen atoms
were added to the structure in optimum positions using REDUCE [37]. Each
position within the protein structure was analyzed by replacement of the
existing side-chain with all other amino acid side-chains using PREKIN [38].
The goodness-of-fit of these substituted amino acids was then assessed
[39]. The side-chains of each amino acid were rotated in steps of 5° around
each of their χ angles to known rotamer boundaries using a rotamer library
[40] in order to optimise efficiency and minimize computational time. After
each rotation step, all-atom contact measurements were carried out using
PROBE [41] to assess how well the residue would fit within the local
structure of the molecule in the given conformation. PROBE uses the rolling
probe algorithm [42] to recognize regions where steric clashes between
atoms occur. Once this had been done for all amino acids at a particular
position, those residues were selected with PROBE score > -1. These were
amino acids with at least one rotameric conformation that did not cause
steric clashes within the existing local protein structure. We predict that
replacements of these residues into the structure would not cause local
structural disruption and are, therefore, more likely to occur than
replacements requiring the local structural environment to shift in order to
accommodate them.
Following the goodness-of-fit analysis, replacement tables were used to
further filter the predicted amino acids at each position in the structure.
Replacement tables were used in order to account for propensity issues
such as the likelihood of buried charges and exposed hydrophobic side-
chains, which would not be taken into account in predictions using
goodness-of-fit alone. The PAM10 [43] replacement matrix was selected to
154
account for the degree of sequence diversity within the protein family. The
likelihood of each of the amino acids occurring at this position were
assessed using the replacement table and an empirically derived cut-off of
0.005 was used to reduce the predicted set of amino acids further.
Amino acids which exceeded the cut-offs for goodness-of-fit and
replacement likelihood formed our predicted set of “probable” residues for
each site in the protein structure (Fig. 6.1B). These are amino acids, which if
substituted into the protein, would cause minimal structural disruption.
Amino acids not predicted by the model are those that can be discounted
from inclusion in coverage analysis. We do not consider false positive
predictions in coverage estimates (amino acids predicted but not observed),
as these are presumed to be residues that are absent from the real data as a
result of other constraints not quantified by the model (Fig. 6.2A).
Data sets
All HIV-1 sequences for which a near full-length genome is available were
downloaded from the LANL HIV Sequence Database (www.hiv.lanl.gov).
This resulted in a data set of 1,179 aligned sequences after strains with
identical names were removed. This data set included 1,091 unique strains.
Sequences corresponding to P17 were extracted, translated to amino acids
and aligned using Muscle [44]. For subtype B all available P17 sequences
were downloaded from the HIV Sequence Database. This resulted in a data
set of 1,405 unique strains. These were translated to amino acids and
aligned using Muscle. Sites that included more than 95% gaps were
removed from all amino acid sequence alignments.
The coverage algorithm
Our algorithm maximizes the coverage across an alignment by finding the
best nine-mers that can be pieced together to form multiple sequence-
constructs (see Fig. 6.5 for further details). Nine-mers that provide the
155
highest coverage at specific locations are chosen preferentially. The first
sequence-construct generated will, thus, provide the highest coverage.
Subsequent constructs each provide less coverage, as lower frequency
nine-mers are included. Combined, the different constructs provide maximal
coverage across HIV’s sequence diversity.
The local coverage provided by an individual nine-mer within a sequence-
construct is the coverage that is provided across all of the sequences within
an alignment at the location where the nine-mer occurs. The mean
coverage provided by a sequence-construct is the average of all the local
covering scores provided by each nine-mer across the sequence. Note, this
approach is not independent of location and so is distinct from a previous
definition of covering score [17] and discussed by Fischer et al. [45].
When constructing sequences our algorithm optimises the mean coverage
score within each construct. It also maximizes the number of unique nine-
mers in the sequence-construct set. Maximizing local coverage scores
ensures the inclusion of nine-mers that provide a high level of coverage of
the diversity present among HIV sequences. Nine-mers are added to the
construct in the order of their local coverage score at their optimal location
(Fig. 6.5). The optimal location is the location within the alignment where the
individual nine-mer provides the optimal achievable coverage. After a nine-
mer is included within a construct it is excluded from any future selection at
that location. Once a full sequence has been constructed overlapping nine-
mers that have not previously been removed are excluded from future
selection. This maximises the coverage by ensuring that we are not
generating redundancy due to nine-mer repetition between the different
sequence-constructs.
Random controls for Fig. 6.4 were generated after each iteration of the
algorithm by randomly sampling a number of sequences from the input data
156
set 500 times. For each sample the mean coverage provided was
calculated. The number of sequences sampled corresponded to the number
of sequence-constructs present at that point during the algorithms progress.
Acknowledgements We thank Astrid Iversen and Andrew McMichael for helpful comments and
discussion. We also thank Fred Bibollet-Ruche, Kathryn Else, Kathryn
Hentges and Katie Finegan for critical reading of the manuscript. JA and
SGW were supported by BBSRC and EPSRC studentships, respectively.
157
References 1. Gao, F., et al., Unselected mutations in the human
immunodeficiency virus type 1 genome are mostly nonsynonymous and often deleterious. J Virol, 2004. 78(5): p. 2426-33.
2. Robertson, D.L., B.H. Hahn, and P.M. Sharp, Recombination in AIDS viruses. J Mol Evol, 1995. 40(3): p. 249-59.
3. Jetzt, A.E., et al., High rate of recombination throughout the human immunodeficiency virus type 1 genome. J Virol, 2000. 74(3): p. 1234-40.
4. Ho, D.D., et al., Rapid turnover of plasma virions and CD4 lymphocytes in HIV-1 infection. Nature, 1995. 373(6510): p. 123-6.
5. Wei, X., et al., Viral dynamics in human immunodeficiency virus type 1 infection. Nature, 1995. 373(6510): p. 117-22.
6. Wolinsky, S.M., et al., Adaptive evolution of human immunodeficiency virus-type 1 during the natural course of infection. Science, 1996. 272(5261): p. 537-42.
7. Yang, W., J.P. Bielawski, and Z. Yang, Widespread adaptive evolution in the human immunodeficiency virus type 1 genome. J Mol Evol, 2003. 57(2): p. 212-21.
8. Choisy, M., et al., Comparative study of adaptive molecular evolution in different human immunodeficiency virus groups and subtypes. J Virol, 2004. 78(4): p. 1962-70.
9. Ciurea, A., et al., CD4+ T-cell-epitope escape mutant virus selected in vivo. Nat Med, 2001. 7(7): p. 795-800.
10. Draenert, R., et al., Immune selection for altered antigen processing leads to cytotoxic T lymphocyte escape in chronic HIV-1 infection. J Exp Med, 2004. 199(7): p. 905-15.
11. Li, Y., et al., Broad HIV-1 neutralization mediated by CD4-binding site antibodies. Nat Med, 2007. 13(9): p. 1032-4.
158
12. Gaschen, B., et al., Diversity considerations in HIV-1 vaccine selection. Science, 2002. 296(5577): p. 2354-60.
13. Nickle, D.C., et al., Consensus and ancestral state HIV vaccines. Science, 2003. 299(5612): p. 1515-8; author reply 1515-8.
14. Letourneau, S., et al., Design and Pre-Clinical Evaluation of a Universal HIV-1 Vaccine. PLoS ONE, 2007. 2(10): p. e984.
15. Catanzaro, A.T., et al., Phase 1 safety and immunogenicity evaluation of a multiclade HIV-1 candidate vaccine delivered by a replication-defective recombinant adenovirus vector. J Infect Dis, 2006. 194(12): p. 1638-49.
16. Fischer, W., et al., Polyvalent vaccines for optimal coverage of potential T-cell epitopes in global HIV-1 variants. Nat Med, 2007. 13(1): p. 100-6.
17. Nickle, D.C., et al., Coping with viral diversity in HIV vaccine design. PLoS Comput Biol, 2007. 3(4): p. e75.
18. De Groot, A.S., et al., HIV vaccine development by computer assisted design: the GAIA vaccine. Vaccine, 2005. 23(17-18): p. 2136-48.
19. De Groot, A.S., et al., Engineering immunogenic consensus T helper epitopes for a cross-clade HIV vaccine. Methods, 2004. 34(4): p. 476-87.
20. Fiorentini, S., et al., Functions of the HIV-1 matrix protein p17. New Microbiol, 2006. 29(1): p. 1-10.
21. Iversen, A.K., et al., Conflicting selective forces affect T cell receptor contacts in an immunodominant human immunodeficiency virus epitope. Nat Immunol, 2006. 7(2): p. 179-89.
22. McMichael, A.J., HIV vaccines. Annu Rev Immunol, 2006. 24: p. 227-55.
23. Frahm N, L.C., Brander C, Identification of HIV-Derived, HLA Class I Restricted CTL Epitopes: Insights into TCR Repertoire, CTL Escape and Viral Fitness, in HIV Sequence Compendium, F.B. Thomas Leitner, Hahn B, Marx P, McCutchan F, Mellors J, Wolinsky S,
159
Korber B, Editor. 2007, Theoretical Biology and Biophysics Group, Los Alamos National Laboratory.: Los Alamos.
24. Chelliah, V., et al., Distinguishing structural and functional restraints in evolution in order to identify interaction sites. J Mol Biol, 2004. 342(5): p. 1487-504.
25. Overington, J., et al., Tertiary structural constraints on protein evolutionary diversity: templates, key residues and structure prediction. Proc R Soc Lond B Biol Sci, 1990. 241(1301): p. 132-45.
26. DePristo, M.A., et al., Ab initio construction of polypeptide fragments: efficient generation of accurate, representative ensembles. Proteins, 2003. 51(1): p. 41-55.
27. de Bakker, P.I., et al., Conformer generation under restraints. Curr Opin Struct Biol, 2006. 16(2): p. 160-5.
28. Kent, S.J., et al., Reversion of immune escape HIV variants upon transmission: insights into effective viral immunity. Trends Microbiol, 2005. 13(6): p. 243-6.
29. Allen, T.M., et al., Selection, transmission, and reversion of an antigen-processing cytotoxic T-lymphocyte escape mutation in human immunodeficiency virus type 1 infection. J Virol, 2004. 78(13): p. 7069-78.
30. Peyerl, F.W., et al., Fitness costs limit viral escape from cytotoxic T lymphocytes at a structurally constrained epitope. J Virol, 2004. 78(24): p. 13901-10.
31. Crawford, H., et al., Compensatory mutation partially restores fitness and delays reversion of escape mutation within the immunodominant HLA-B*5703-restricted Gag epitope in chronic human immunodeficiency virus type 1 infection. J Virol, 2007. 81(15): p. 8346-51.
32. Frahm, N., et al., Increased sequence diversity coverage improves detection of HIV-specific T cell responses. J Immunol, 2007. 179(10): p. 6638-50.
160
33. Rolland, M., D.C. Nickle, and J.I. Mullins, HIV-1 group M conserved elements vaccine. PLoS Pathog, 2007. 3(11): p. e157.
34. Welsh, R.M. and R.S. Fujinami, Pathogenic epitopes, heterologous immunity and vaccine design. Nat Rev Microbiol, 2007. 5(7): p. 555-63.
35. Berman, H.M., et al., The Protein Data Bank. Nucleic Acids Research, 2000. 28(1): p. 235-242.
36. Hill, C.P., et al., Crystal structures of the trimeric human immunodeficiency virus type 1 matrix protein: implications for membrane association and assembly. Proc Natl Acad Sci U S A, 1996. 93(7): p. 3099-104.
37. Word, J.M., et al., Asparagine and glutamine: using hydrogen atom contacts in the choice of side-chain amide orientation. J Mol Biol, 1999. 285: p. 1735-1747.
38. Richardson, D.C. and J.S. Richardson, The kinemage: a tool for scientific illustration. Protein Science, 1992. 1: p. 3-9.
39. Word, J.M., et al., Exploring steric constraints on protein mutations using MAGE/PROBE. Protein Science, 2000. 9: p. 2251-2259.
40. Lovell, S.C., et al., The penultimate rotamer library. Proteins: Structure, Function and Genetics, 2000. 40: p. 389-408.
41. Word, J.M., et al., Visualizing and quantifying molecular goodness of fit: small-probe contact dots with explicit hydrogen atoms. J Mol Biol, 1999. 285: p. 1711-1733.
42. Connolly, M.L., Solvent-Accessible Surfaces of Proteins and Nucleic-Acids. Science, 1983. 221: p. 709-713.
43. Dayhoff, M.O., W.C. Barker, and L.T. Hunt, Establishing homologies in protein sequences. Methods Enzymol, 1983. 91: p. 524-45.
44. Edgar, R.C., MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res, 2004. 32(5): p. 1792-7.
161
45. Fischer, W., et al., Coping with viral diversity in HIV vaccine design: a response to Nickle et al. PLoS Comput Biol, 2008. 4(1): p. e15; author reply e25.
162
Chapter 7: Detection of Low Frequency CXCR4-Using HIV-1 with Ultra-deep Pyrosequencing
Abstract Pyrosequencing produces unprecedented quantities of genomic sequence
data that can be used for ultra-deep monitoring of HIV’s intra-patient viral
population. Identification of low-frequency variants is of particular
importance in the identification of potential for pre-existing drug resistance.
Here we address the detection of potential resistance to the CCR5
antagonist maraviroc due to the pre-treatment presence of CXCR4-using
virus. To do this we present a novel protocol, implemented using the Java
programming language, for the management of pyrosequencing data and
genotypic-identification of probable CXCR4-using virus. This includes
alignment without the use of an external reference sequence (critically
important due to the high levels of variation in the variable regions),
translation of reads while maintaining inter-read alignment, determination of
per site diversity and detection of the specific amino acids in V3 associated
with phenotypic variation. We apply the protocol to two extremely large 454
data sets containing 105,000 (pre-treatment, day 1) and 192,000 (post-
treatment, day 11) reads from HIV-1’s envelope [1] in order to determine the
phenotypes present within the intra-patient population. We use the charge
rule, PSSM and geno2pheno to determine phenotype based on sequence
variation. CXCR4-using virus can be detected using all tests at a very low
frequency prior to maraviroc treatment. We then re-construct phylogenetic
trees from reads that span the V3 region. This permits a detailed
visualisation of ultra-deep intra-patient evolution through time involving 9,000
sequences (corresponding to 1,800 unique V3 variants). This phylogeny
confirms the CXCR4-using viruses at day 11 most probably emerged from
pre-existing day 1 viruses, and have not evolved directly from CCR5-using
virus.
163
Introduction During the early nineteen nineties it was observed that HIV-1 viruses could
be characterized into two phenotypes referred to as syncytium inducing (SI)
and non – syncytium inducing (NSI) [2]. These phenotypes have different
cellular tropisms due to differences in co-receptor usage [3] and appear
during different stages of infection [4]. The macrophage tropic NSI
phenotype requires the CCR5 co-receptor [5] (referred to as R5) and are
predominant during the early stages of infection [6] while the T cell tropic SI
phenotype uses the CXCR4 co-receptor (referred to as X4) [7] and often
emerge later on during and around the time of progression to AIDS [8]. The
early dominance of the CCR5-using phenotype may be due to a number of
factors including selection at the point of transmission [9-11], a higher
cytopathicity of SI variants resulting in host cells with a shorter life span [12]
as well as differing fitness levels of the two variants at different stages of
disease progression [13].
Co-receptor usage can be tested for with an experimental assay [14] or
detected by computational analysis based on specific amino acid changes
within the V3 loop of the gp120 gene [15-18]. Amino acid sequence
variations giving rise to a more positive charge within the gp120 gene at
sites 306 and 320 (sites 11 and 25 of the V3 loop) have been strongly linked
with the more virulent CXCR4-using viruses [15-18]. Site 319 has also been
observed to contribute significantly [19]. These sites oppose each other
within the beta - hairpin structure and when all negatively charged residues
are present the electrostatic interaction with the CCR5 coreceptor is thought
to be stabilized [19, 20]. These sites can be used to predict viral phenotype
with up to a 94% accuracy [19] . There are other less well understood
interactions between sites within the V3 loop where charge contributes
significantly to the presence of a particular phenotype (Figure 7.1). Both
Web PSSM [21] and geno2pheno [22] are successful genotyping algorithms
that attempt to account for more subtle changes that can take place [23].
164
Figure 7.1 Sequence logos of the CCR5 and CXCR4-using viruses
All V3 sequences were extracted from the HIV Los Alamos Sequence database and aligned
using ClustalW. Full arrowed lines represent sites used in the charge rule where positively
charged residues are only found within the CXCR4-using phenotype. Dashed arrowed lines
represent additional sites where positively charged residues were observed with the
CXCR4-using phenotype but not in the CCR5-using phenotype.
Recently Pfizer has developed a small-molecule drug, maraviroc, that binds
to the CCR5 receptor making it unavailable for HIV-1 cell entry [24, 25]. Co-
receptor blockage is a novel way of attempting to control the progression of
HIV within the host (Figure 7.2). Targeting the CCR5 co-receptor as with the
natural polymorphism (CCR5Δ32) has few deleterious effects [26].
Individuals that are heterozygous for this mutation were found to have a
slower disease progression to AIDS while individuals that are homozygotic
for the mutation show strong resistance to HIV-1 [26-30].
Treatment with maraviroc effectively mimics the CCR5Δ32 phenotype within
an individual. However tests indicate that in patients with treatment failure
the presence of CXCR4-using variants prior to treatment is predictive of this
outcome [1] as such variants will be strongly positively be selected for. It is
therefore vitally important to detect low frequency CXCR4-using variants
before treatment. The detection of low frequency variants is also a more
165
generic problem that is associated with determining the success of other
retroviral drug treatment regimes [31, 32]. With traditional Sanger
sequencing the reliable detection of such variants is not practical [33] .
Figure 7.2 Cell Entry Inhibition
Co – receptor antagonist such as maraviroc, red circle, blocking HIV cell entry.
“Ultra-deep” [34, 35] sequencing permits the quantification of the range of
sequence variants present within an HIV-1 sample [36]. Pyrosequencing
technology produces very high numbers of short sequence fragments that
can increase the sensitivity of the detection of low frequency variants but
which introduces unique problems for the computational analysis of
sequence data [37]. The process differs fundamentally from the now
traditional Sanger sequencing chemistry. For example in the Roche (454)
GS FLX system [34, 35] individual bases are not read directly. Instead the
lengths of homopolymeric runs are determined for each base at each
position of the sequence in a predetermined cyclic order. This produces 4
flowgrams, one corresponding to each base, that provides information about
which base is present at individual sites (Figure 7.3). The signal intensity
produced at individual positions on each of the four flowgram is then
166
rounded to the nearest integer and used to determine how many bases of a
particular type must be inserted.
Figure 7.3 Generation of a Flowgram
Within an individual well bases are supplied in a specific flow order. Four flows,
corresponding to the four bases, are referred to as a complete cycle. During a single flow if
the specific base type is incorporated into the growing sequence a detectible light signal is
emitted as described in the text. Multiple cycles are run with the end result being a
flowgram that can be interpreted as a nucleotide sequence.
For example the results of two complete cycles could be T:1.8, A:0.1, C:0.9,
G:0.1, T:1.6, A:0.0, C:0.4, G:1.0 [36, 38] (Figure 7.3). This would
correspond to the sequence TTCTTG. Ambiguities in signal intensity
however can lead to errors [39]. The closer the signal is to the half way
mark between two integers (e.g. T:1.6 is rounded to TT but if it was T:1.4 it
would just be T) the more likely a read error will occur in the form of either an
insertion or deletion (indel). As a result the majority of errors are due to
overcalls and undercalls [36, 38]. Noise that reduces the quality of the
emitted signal which leads to an increase in such errors can be caused by
167
signal contamination from nearby wells, multiple templates on a bead and a
loss of synchrony [36, 38]. So far the error rates within the data produced by
the Roche (454) GS FLX system are poorly characterized.
In a recent study by Wang et al., [36] where 6827 HIV-1 reads were
analyzed (8% were removed due to low sequence identity with the
templates) a mean error rate of 0.98% was observed within Pyrosequenced
data produced according to [35]. This was sub divided into insertion rates
(0.73%), deletion rates (0.16%) and mismatches (0.12%). It was further
observed that the error rate within homopolymeric regions of size three or
greater was 6.2 times higher than outside of these regions (0.07%). Due to
the large amounts of data produced however such error rates can potentially
have an effect on the results of an analysis and further efforts are required in
order to incorporate them into the processing of the 454 data.
Pyrosequencing technology can be used to track the emergence or
presence of low frequency viral variants early on during a HIV infection [40,
41]. Automated protocols are necessary due to the huge volume of data
generated. Despite this there are currently no user friendly desktop
applications available for handling and analyzing the large numbers of short
reads produced [42]. Here we present prototype of a protocol for the generic
analysis of viral population pyrosequenced data (Fig. 7.4). We demonstrate
the use of the prototype by applying it to two extremely large 454 data sets
containing 105,000 (pre-treatment, day 1) and 192,000 (post-treatment, day
11) reads from HIV-1’s envelope region [24] and use it in conjunction with
the charge rule, PSSM and geno2pheno to identify the phenotypes present
within a single patient at the two individual time points. We also compare the
indel rates within our data to the indel errors calculated by Wang et al., [36].
This unprecedented dataset is a definitive snapshot of intra patient sequence
diversity, in particular the identification of low frequency variants related to
co-receptor usage.
168
Figure 7.4 Protocol for Handling Pyrosequenced Data
Protocol for the handling of data produced by the 454 Life Science pyrosequencer. Green
circles represent steps that can be automated while orange circles represent steps that
169
potentially require external programs such as alignment tools and phenotype tests. The red
line represents the consensus template sequence generated from the individual templates
which are constructed from the input reads. In step 5 the yellow dots represent gaps
inserted into the reads during the alignment process while the purple dots represent
insertions that have been removed from the reads during processing. The sequence logo
on the bottom left hand side is a representation of the residue composition of sites 11, 24
and 25 for sequences representing each of the CCR5 and CXCR4-using phenotypes (taken
from Fig. 7.1). The webpage is the interface for the PSSM test.
Methods
Datasets
Pyrosequencing data: Two pyrosequencing datasets from patient A [24]
were provided by 454 Life Sciences. gp160Amplicons of the full-length
envelope gene from two plasma samples (days 1 and 11) were randomly
fragmented and sequenced using a 200 nucleotide average read length
protocol to a depth of greater than 10,000 reads. In total 104,628 and
191,637 nucleotide sequence reads were generated for the day 1 and day
11 sample, respectively. Patient A was infected with a variant classified as
subtype B and was one of two patients in whom CXCR4-using tropic viruses
became dominant post maraviroc treatment. 11 clones from day 1 and 12
from day 11 were also available [24].
Database gp160 sequences: All near full-length gp160 subtype B amino acid
sequences were downloaded from the Los Alamos HIV sequence database.
After removal of sequences with identical titles this left 1,986 for analysis.
These were aligned using Muscle [43] following which columns containing
more than 90% gap were removed.
Day 1 and Day 11 datasets
A general overview of the protocol developed to handle the pyrosequenced
data is presented in (Fig. 7.4). The steps of the protocol are:
170
Template Construction: Due to the high level of divergence between HIV-1
sequences, particularly in the variable regions of the envelope, a data-
specific reference sequence was constructed from each of the two
pyrosequenced datasets (Figure 7.4). Use of a consensus template
minimized the number of insertions removed from the reads during the
subsequent pairwise alignment steps. The construction of a data-specific
reference sequence involved the generation of a population consensus
sequence based on 100 template sequences constructed from concatenated
reads. These templates were each constructed from overlapping high-
identity reads. The parameters used were: maximum overlap (200
nucleotides), minimum overlap (100 nucleotides), mismatches (4) and final
sequence length (2,500 nucleotides). The mismatches are the number of
non-identical sites in the overlapping regions; permitting up to 4 ensured that
the concatenated sequences spanned most of gp160. These were based on
the average read length and approximate length of gp160 (Figure 7.5).
In total 150 concatenated template sequences were constructed for each of
the two datasets. As the reads are in both directions, the concatenated
sequences not in the direction 5’ to 3’ (74 and 86 for days 1 and 11,
respectively) were complemented using HXB2 as a reference. The
consensus template sequence was constructed from the 100 of these that
maximised the coverage of gp160. These were aligned with Muscle [43].
Sites containing >90% gaps were removed from the alignment and a
majority-rule consensus sequence constructed.
Pairwise Alignments: All reads were then aligned in a pairwise manner to the
consensus template sequence using the Smith-Waterman algorithm [44] with
the parameters: gap opening penalty (-4), gap extension penalty (-2), match
(+2), transition (-1) and transversion (-2). To maintain site compatibility
between reads and ensure the removal of all pyrosequencing based errors
171
that cause deleterious frame shifts in relation to the consensus template
sequence any insertions within the reads in relation to the template
sequence were removed. A more precise method of dealing with these
insertions is discussed in the discussion section. The length and
frequencies of the insertions (along with any deletions) where recorded.
During the pairwise alignment any reads producing an exceptionally poor
alignment (Figure 7.5) were removed from the dataset.
Translations: Each of the aligned reads was then translated into all three
reading frames and the highest scoring translation after realignment with the
corresponding region on the translated template sequence selected.
Detection of CXCR4-using phenotype: Genotypic detection of CXCR4-using
virus in the V3 region was carried out on all pyrosequenced reads for the two
datasets:
(1) Partial V3 reads: Sites 11, 24 and 25 are predictive of CXCR4-using
variants [19, 20] so were extracted from the datasets maintaining site
compatibility between reads and used for the charge rule test. In addition,
sites 7, 8, 22 and 30 were used as these are also predictive of co-receptor
usage (Figure 7.1). For CXCR4-using strains designation a positive charge
had to be present in at least one of these sites; for the CCR5-using strains
designation no positive charge could be present. Thus, if only two sites were
present and found to be negatively charged the phenotype could still not be
guaranteed as being CCR5-using as the remaining unrepresented sites
could contain a positively charged residue. For consistency the charge at all
sites had to be determined.
(2) Complete V3 reads: Translated reads that span the entire V3 region were
also extracted from the datasets. The phenotype of the reads was predicted
using charge rule as with the partial fragments, PSSM [21] and geno2pheno
172
[22]. Fewer sequences could be tested in this manner due to the
requirement of full length V3 regions as opposed to the individual site
requirement above.
Phylogenetic analysis: The corresponding nucleotide sequences for the
complete V3 regions were aligned using Muscle. PhyML [45] was used to
re-construct maximum likelihood trees using the HKY model of nucleotide
substitution with the ratio between transversions to transitions being et to 2.
Database entropy and amino acid usage
The per site Shannon entropy was calculated using the Shannon entropy
formulae as described on the Los Alamos HIV database website from the
sequences that were extracted from the database. Site compatibility was
maintained with the templates constructed for the day 1 and day 11 data by
aligning the consensus template sequences in with the database data and
only maintaining columns that did not have gaps for these sequences.
Results
The majority of pyrosequenced reads aligned to the template sequence with
high alignment scores, >0.8, for both datasets (Figure 7.5 B and D). The
peak of low quality pairwise alignments, <0.2, mainly correspond to longer
sequence fragments (>250 nucleotides). Low identity reads, <0.6, were
excluded from further analysis leaving 94,029 and 177,459 reads,
corresponding to 18,944,164 and 35,724,117 bases, for days 1 and 11
datasets respectively. This was a removal of 10.1 and 7.4% of the data
respectively. These numbers of low identity reads are similar to those
observed in Wang et al., [36] where a first generation pyrosequencer (454
Life Science GS20) was used. From Figure 7.5, Panels A and C, it can be
observed that the majority of reads within these lower peaks are greater than
250 nucleotides in length.
173
Figure 7.5 Segment Length, Alignment Score and Frequency
(A and C) The relationship between read length and score for day 1 and day 11 data
respectively, (B and D) the relationship between the score and frequency of occurrence for
both datasets respectively. The blue shaded area represents reads that showed good
sequence identity with localized regions of the template. The red shaded area represents
reads that had very little or no sequence identity with the template sequence.
The coverage obtained across the gp160 region for the day 1 and day 11
data is displayed in Figure 7.6. All regions of gp160 are well represented.
Note, the coverage peak towards the 3’ end of gp160 is an artefact of the
sequencing process that has been previously observed [46].
174
Figure 7.6 Nucleotide Coverage Across gp160
Coverage provided across the gp160 region by the day 1 (blue) and day 11 (red) datasets
after sequences of low identity have been removed. Genomic regions are depicted along
the x-axis. Conserved regions are shaded black while the variable regions are shaded grey.
The inset displays the frequency of occurrence of individual read lengths.
Indel frequencies in relation to the consensus template are described in table
7.1. These are the frequencies after low identity sequences have been
removed. The mean insertion frequency in for the day 1 data set was
0.8549% while the mean deletion frequency was 0.4796%. For the day 11
dataset these numbers were 1.2107 and 0.7423% respectively. The
removal of indels using the reference sequence as a guide conservatively
maintains the correct reading frames within individual reads. However
potentially viable indels that occur will also be removed during this process
thus a small proportion of the data is lost. If viable these would be in
multiples of 3 nucleotides corresponding to codon lengths. For insertions
and deletions these constituted 0.0054 and 0.00057% for day 1 and 0.0995
and 0.0763 for day 11 respectively (highlighted green in table). When these
are removed from the analysis the potential error frequencies become
0.8495 and 0.4790% for day 1 and 1.1112 and 0.6659 for day 11
respectively.
175
Table 7.1 Rates of Insertion and Deletion
Frequencies of the length of indel regions stored during the pairwise alignment step of the
protocol displayed in figure 7.4.
In order to display per site diversity, Shannon entropy was calculated for
each site across gp160 for the two datasets and compared to subtype B
sequences from the Los Alamos HIV sequence database (Figure 7.7). On
day 1 the mean entropy across the gp120 region is 0.2 with the mean for the
variable regions being 0.32 and for the conserved regions being 0.18. For
gp41 the mean entropy is 0.13. On day 11 the mean entropy across gp120
is 0.18 with the mean for the variable regions being 0.26 and for the
conserved regions being 0.15. The gp41 has entropy of 0.14. The entropy
at individual sites 7, 8, 11, 22, 24, 25, and 30 within the V3 loop are
highlighted in the figure (panels B and C).
176
Figure 7.7 Shannon Entropy Across gp160
Entropy plot for (A) the subtype B sequences extracted from the HIV Los Alamos sequence
database, (B) the day 1 dataset, (C) the day 11 dataset and (D) the difference between the
177
day 1 and day 11 datasets. In (D) the shades grey region covers 90% of the data. Various
locations within the gp160 gene are displayed along the x-axis.
The outcome of the phenotype tests used on the day 1 dataset are displayed
in Table 7.2 Panel A. The charge rule only required the presence of sites
11, 24 and 25 within the V3 and so many more partial V3 reads could be
tested (5986) when compared to the PSSM test and geno2pheno (3384) as
both required complete V3 regions. Of the 5986 reads only 18 reads
(0.30%) of the CXCR4-using phenotype were detected using the complete
charge rule.
Table 7.2 HIV-1 Phenotypes Counts
Phenotypes counts for day 1 (A) and Day 11 (B) data using the three tests described in the
text.
Of the 3384 full-length V3 reads in the day 1 data set, CCR5-using strains
were the predominant phenotype. Only 7 reads were classified according to
PSSM as being CXCR4-using. Geno2pheno classified 33 reads as CXCR4-
using. 11 CXCR4-using reads were observed using the charge rule on
178
these full-length V3 reads. In this case the number of potentially falsely
predicted CXCR4-using reads using the charge rule drops to 4 (fewer total).
For the day 11 dataset 12,086 V3 fragments were available for testing using
the charge rule, while 6687 complete V3 reads were available for
geno2pheno and PSSM (Table 7.2 Panel B). Of the full-length V3’s, 5383
reads were classified as CXCR4-using according to PSSM while 5479 were
classified as CXCR4-using according to geno2pheno. 5,423 of the complete
V3 reads were classified as CXCR4-using according to the charge rule.
When the charge rule was run on the 12,086 containing both full and partial
V3 fragments 9780 could be classified as CXCR4-using. Individually all
three tests displayed very little differences between the percentages of the
CXCR4-using phenotype constituting the dataset.
For both datasets when the charge rule is performed with the inclusion of
sites 7, 8, 22 and 30 very little alteration of the ratio’s between the CXCR4-
using and CCR5-using viruses was observed (Table 6.2 Panel A and B).
The extra positive charge detected at these sites within the CXCR4-using
viruses did not seem to play a significant role in determining CXCR4-usage.
In relation to the charge rule for individual sites, and for complete v3 regions,
the presence of a residue at each of the sites was a requirement. This is
necessary in order to reduce the number of reads being falsely classified as
CCR5-using. However in relation to the detection of the CXCR4-using
phenotype every residue does not need to be present. This is because as
long as a single positive charge exists at any of the sites the CXCR4-using
viruses can be determined.
Table 7.3 Panels A and B is an updated version of table 7.2 where the
CXCR4-using phenotype is determined based on a positive charge being
present at any one of the sites 11, 24 and 25 regardless of whether one or
more residues are missing. The loss of the condition that a residue has to
179
be present at each of the sites tested allows for a greater number of reads to
be used.
Table 7.3 Alternate Phenotype Counts
CXCR4-using phenotype counts for day 1 (A) and Day 11 (B) data when a single positive
charge is present at individual sites regardless of whether residues are present at all other
sites or not.
Figure 7.8 displays a maximum likelihood tree consisting of all the V3
segment data from days 1 and 11. In total there are 1799 unique fragments
representing 9071 variants (3384 from day 1 and 6687 from day 11). The 23
cloned sequences described in Westby et al., [1] from both days have been
truncated and added to the alignment (not shown). It was be observed that
the cloned sequences make up only a small proportion of the diversity that is
present on the tree. Clusters of the day 1 CCR5-using (green), day 11
CXCR4-using (orange) and day 11 CCR5-using (blue) V3 reads appear to
be strong with a couple of exceptions.
180
181
Figure 7.8 Phylogenetic Tree of Day 1 and Day 11 V3 Segment Data
(A) Maximum likelihood tree containing all the full length V3 nucleotide reads isolated from
both the day 1 and day 11 datasets. Phenotype classifications (based on both PSSM and
the charge rule) are represented according to the colours described in the figure key. (B) Identical tree to that of (A). Colours describe the number of variants present at the tips
instead of phenotype.
Day 1 CCR5-using strains are not present within any of the day 11 clusters.
In contrast within the main day 11 CXCR4-using cluster three day 1 CXCR4-
using variants can be observed to be located near the centre of the tree.
Considering that there were a maximum of 11 day 1 CXCR4-using variants
in total this is a substantial number of day 1 CXCR4-using variants to be
located within the day 11 CXCR4-using cluster. This indicates that the day
11 CXCR4-using variants may have arose from CXCR4-using variants that
were present at very low frequency within the population at an earlier time
point following drug treatment.
Discussion We have demonstrated the use of a novel protocol (Fig. 7.3) for the accurate
handling of the large amounts of short pyrosequenced data generated by the
Roche (454) GS FLX system [34, 35]. Using our protocol, in conjunction
with three standard phenotyping methods on 3384 full length V3 reads, 11
(charge rule), 7 (PSSM) and 33 (geno2pheno) CXCR4-using variants were
detected within the day 1 dataset that had been previously screened for the
absence of the CXCR4-using viruses. When 5986 partial and complete V3
reads were tested using the charge rule the number of CXCR4-using
variants detected increased to 18.
For the day 1 dataset the 11 (from the 3,384 complete V3 sequences) and
18 (from the 5,986 complete and partial sequences) CXCR4-using’variants
that were detected using the charge rule are unlikely to be due to
182
substitution sequencing error. When the generic 0.1% pyrosequencing error
rate (for substitutions) [36] is applied to the nucleotides determining the
residues present at sites 11 and 24, assuming two non-synonymous
positions in each codon, the maximum number of expected false predictions
is less than 4 and 6 respectively. The ratio of positively charged codons to
negatively charged codons when determining this error rate is taken to be
8/53 as there are 8 codons resulting in a histidine, lysine or arginine. Site 25
was observed not to contribute to the phenotype switch within the day 1
dataset and so was omitted form the calculation.
When additional sites 7, 8, 22 and 30 were used for an extended charge rule
test (Table 7.3), on 10,204 full and partial V3 fragments, the number of reads
predicted to be of the CXCR4-using variants rose to 33 while the potential
error rate increases to <28 due to the addition of the extra sites. These extra
sites, where additional positive charge was observed within the CXCR4-
using variants (Figure 7.1), do not appear to have a major influence on
increasing the number of CXCR4-using variants detected. This suggests
that they may have changed amino acid residues after the phenotype switch
occurred and may not be of primary importance during the switch itself.
Detailed analysis of V3 loop data is thus required in order to determine the
exact evolutionary relationships between the sites within the region and the
role that each site plays in determining phenotype. Knowing that sites 11, 24
and 25 have the strongest influence [19, 20] what co-evolutionary
relationships exist between these sites and other sites within the region? For
example, could positive charge at other sites be solely responsible for co-
receptor switching or could more positively charged residues occur within the
region to increase the affinity for the CXCR4 co-receptor after the initial
switch has occurred based on the charge present at sites 11, 24 ad 25? Our
results would indicate the importance of sites outside of 11, 24 and 25 could
be limited in relation to the initial phenotype switch.
183
Within the day 1 and day 11 dataset indels accounted for 1.3345 and
1.953% of the data respectively. This is the percentage of the total number
of bases sequenced on each day. For the day 1 dataset 64% of the data
constituted of insertions while the remaining 36% was made up of deletions.
For the day 11 dataset these numbers were 62% and 38% respectively.
This ratio reflects those observed by Wang et al., on a smaller dataset [36]
where insertions were observed to be more frequent than the deletions. The
overall error rate calculated by Wang et al., was 0.98% of the dataset. The
difference between this error rate and our indel rates is presumably due to
the larger size of our data set.
The percentage of codon (triplet) insertions within the day 1 dataset was
observed to be 0.0054 (Table 6.1). For the charge rule three codons are
used to determine the CXCR4-using viruses. The chance that a complete
codon insertion will occur at any of these sites within an individual read thus
potentially disrupting the charge rule result is 0.0054%. Within an individual
read the chance of having an altered false prediction based on a codon
insertion is thus 0.0162%. With 5,986 reads being tested the number of
expected erroneous predictions based on such insertions is be 0.9697. This
is a very small increase in the error rate and when this is combined with the
expected error due to mismatches (discussed above) the number of CXCR4-
using’s detected on day 1 still above the expected error – confirming the
presence of CXCR4-using within the day 1 dataset.
An analysis of the day 1 and day 11 phylogeny combined (Fig. 7.8) reveals
that at the centre of the main day 11 CXCR4-using cluster three day 1
CXCR4-using variants (red) can be identified. These are located closer to
the root of the cluster than the day 11 CCR5-using variants (blue) that can
be identified within the cluster. It is most probable that this cluster has
emerged through the selection of day 1 CXCR4-using variants following
184
maraviroc therapy as no day 1 CCR5-using variants are located within the
vicinity of the day 11 CXCR4-using variants.
The mean amino acid entropy values for the various regions within the
gp160 gene are displayed in Figure 7.7. The entropy values for the global
subtype B data are higher than that of the day 1 and day 11 intra patient
data. More of HIV-1’s viable sequence space is being explored at an inter
host level due to the genetics of different hosts rather than within a single
individual host at a single time point in the immune history [47]. It is
interesting to note in each of the datasets the higher entropies at the
exposed surface loops of the envelope protein when compared to the more
protected internal regions.
Both the day 1 and day 11 datasets provided complete coverage across the
gp160 region even after sequences of low identity were removed (Fig. 7.6).
When the read lengths were plotted against the pairwise alignment score
with the template it was observed that at reads of size of 250 and less there
was no correlation between the length and the alignment score (Fig. 7.5,
Panels A and C). However reads larger than 250 nucleotides were often
associated with a poorer score. This emphasises that the current
pyrosequencing technology is not reliable for producing reads of longer
length [35]. This is possibly due to a number of cumulative factors
contributing to noise within individual wells during sequencing [36, 38].
The short nature of pyrosequencing data imposes some implicit problems,
such as template construction, storage, alignment, translation and isolation
of individual sites that our protocol overcomes. For the first time we have
provided a high resolution snapshot of intra patient data pre and post
treatment with the CCR5 inhibiting compound maraviroc. We have also
provided the ability to repeat this analysis on any dataset. Using this
snapshot we have presented evidence supporting the theory that the rapid
185
emergence of the CXCR4-using viruses following treatment of Patient A with
maraviroc was due to the pre treatment presence of CXCR4-using virus.
This presence was previously undetectable. Thus usage of maraviroc is
indeed a feasible option in the development of novel treatment regimes for
HIV when used in the absence of CXCR4-using viruses [1].
Future Developments
The protocol developed for managing the data output from the 454
sequencer played a central role in the analysis. It is intuitive, user friendly
and straightforward to implement. Its assumptions based on the treatment of
errors were discussed in the previous section. Although the effects of these
errors are small in relation to how the protocol conservatively dealt with
them, as discussed, the amount of data available makes it important to be as
precise as possible. Distinctions between pyrosequencing error and
potential naturally occurring error must be defined. The differences in these
error rates are not well understood [36, 38] and will have an effect (although
minimized within our protocol) on the accuracy of using pyrosequenced data
in relation to detecting low frequency variants within populations.
The observed error rates (Table 6.1 and [36]) confirm that preprocessing of
the 454 data is an important requirement in order to increase the accuracy
and reliability of results in relation to the detection of low frequency variants.
A proposed modified protocol that is based closely on Figure 7.4 is
presented in figure 7.9. The modified version accounts for insertions within
the reads by incorporating a stage where the consensus template is
extended based on the frequencies of inserted complete codons within the
reads. Segments will be realigned to this extended consensus template
reducing the need to loose viable insertions that do not disrupt the reading
frame. Translated reads will be aligned to the translated template sequence
in order to get a more precise positioning of amino acid residues that are not
dependent on the original nucleotide alignment. The amino acid alignment
186
will then be used in order to obtain residues at individual sites or across
regions for phenotype testing as well as to refine the original nucleotide
alignments. The refined nucleotide alignments can be used in order to
obtain an exact measure of the intra host mutation rates within the HIV
lifecycle. Tree re-construction will occur as before but will use the realigned
reads so that more phylogenetically relevant information will be used to
determine the nature of clustering within the reads.
187
Figure 7.9 Updated Protocol
An extended version of the protocol presented in figure 7.4. Grey boxes highlighted the
parts to be updated as described in the text.
188
Acknowledgements We would like to thank Richard Harrigan from the British Columbia Centre of
Excellence in HIV/AIDS, 454 Life Sciences for supplying the 454 sequence
data, the investigators, study-site staff, the Pfizer maraviroc development
team and the patients who participated in the maraviroc studies. In addition
we thank Alex Thielen for help with geno2pheno. This project was funded by
the BBSRC and Pfizer Global R&D.
189
References 1. Westby, M., et al., Emergence of CXCR4-using human
immunodeficiency virus type 1 (HIV-1) variants in a minority of HIV-1-infected patients following treatment with the CCR5 antagonist maraviroc is from a pretreatment CXCR4-using virus reservoir. J Virol, 2006. 80(10): p. 4909-20.
2. Koot, M., et al., HIV-1 biological phenotype in long-term infected individuals evaluated with an MT-2 cocultivation assay. Aids, 1992. 6(1): p. 49-54.
3. Moore, J.P., et al., The CCR5 and CXCR4 coreceptors--central to understanding the transmission and pathogenesis of human immunodeficiency virus type 1 infection. AIDS Res Hum Retroviruses, 2004. 20(1): p. 111-26.
4. Shankarappa, R., et al., Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. J Virol, 1999. 73(12): p. 10489-502.
5. Connor, R.I., et al., Change in coreceptor use coreceptor use correlates with disease progression in HIV-1--infected individuals. J Exp Med, 1997. 185(4): p. 621-8.
6. Connor, R.I. and D.D. Ho, Human immunodeficiency virus type 1 variants with increased replicative capacity develop during the asymptomatic stage before disease progression. J Virol, 1994. 68(7): p. 4400-8.
7. Bjorndal, A., et al., Coreceptor usage of primary human immunodeficiency virus type 1 isolates varies according to biological phenotype. J Virol, 1997. 71(10): p. 7478-87.
8. Koot, M., et al., Conversion rate towards a syncytium-inducing (SI) phenotype during different stages of human immunodeficiency virus type 1 infection and prognostic value of SI phenotype for survival after AIDS diagnosis. J Infect Dis, 1999. 179(1): p. 254-8.
9. Meng, G., et al., Primary intestinal epithelial cells selectively transfer R5 HIV-1 to CCR5+ cells. Nat Med, 2002. 8(2): p. 150-6.
190
10. Reece, J.C., et al., HIV-1 selection by epidermal dendritic cells during transmission across human skin. J Exp Med, 1998. 187(10): p. 1623-31.
11. Agace, W.W., et al., Constitutive expression of stromal derived factor-1 by mucosal epithelia and its role in HIV transmission and propagation. Curr Biol, 2000. 10(6): p. 325-8.
12. Rodrigo, A.G., Dynamics of syncytium-inducing and non-syncytium-inducing type 1 human immunodeficiency viruses during primary infection. AIDS Res Hum Retroviruses, 1997. 13(17): p. 1447-51.
13. Arien, K.K., et al., Replicative fitness of CCR5-using and CXCR4-using human immunodeficiency virus type 1 biological clones. Virology, 2006. 347(1): p. 65-74.
14. Whitcomb, J.M., et al., Development and characterization of a novel single-cycle recombinant-virus assay to determine human immunodeficiency virus type 1 coreceptor tropism. Antimicrob Agents Chemother, 2007. 51(2): p. 566-75.
15. Hwang, S.S., et al., Identification of the envelope V3 loop as the primary determinant of cell tropism in HIV-1. Science, 1991. 253(5015): p. 71-4.
16. Shankarappa, R., et al., Evolution of human immunodeficiency virus type 1 envelope sequences in infected individuals with differing disease progression profiles. Virology, 1998. 241(2): p. 251-9.
17. Shioda, T., J.A. Levy, and C. Cheng-Mayer, Small amino acid changes in the V3 hypervariable region of gp120 can affect the T-cell-line and macrophage tropism of human immunodeficiency virus type 1. Proc Natl Acad Sci U S A, 1992. 89(20): p. 9434-8.
18. Pollakis, G., et al., Phenotypic and genotypic comparisons of CCR5- and CXCR4-tropic human immunodeficiency virus type 1 biological clones isolated from subtype C-infected individuals. J Virol, 2004. 78(6): p. 2841-52.
19. Cardozo, T., et al., Structural basis for coreceptor selectivity by the HIV type 1 V3 loop. AIDS Res Hum Retroviruses, 2007. 23(3): p. 415-26.
191
20. Rosen, O., et al., Molecular switch for alternative conformations of the HIV-1 V3 region: implications for phenotype conversion. Proc Natl Acad Sci U S A, 2006. 103(38): p. 13950-5.
21. Jensen, M.A., et al., Improved coreceptor usage prediction and genotypic monitoring of R5-to-X4 transition by motif analysis of human immunodeficiency virus type 1 env V3 loop sequences. J Virol, 2003. 77(24): p. 13376-88.
22. Sing, T.B., N. Kaiser, R. Hoffmann, D. Däumer, M. Lengauer, T., Geno2pheno[coreceptor]: A tool for predicting coreceptor usage from genotype and for monitoring coreceptor-associated sequence alterations. 2005.
23. Lengauer, T., et al., Bioinformatics prediction of HIV coreceptor usage. Nat Biotechnol, 2007. 25(12): p. 1407-10.
24. Dorr, P., et al., Maraviroc (UK-427,857), a potent, orally bioavailable, and selective small-molecule inhibitor of chemokine receptor CCR5 with broad-spectrum anti-human immunodeficiency virus type 1 activity. Antimicrob Agents Chemother, 2005. 49(11): p. 4721-32.
25. Fatkenheuer, G., et al., Efficacy of short-term monotherapy with maraviroc, a new CCR5 antagonist, in patients infected with HIV-1. Nat Med, 2005. 11(11): p. 1170-2.
26. O'Brien, T.R., et al., HIV-1 infection in a man homozygous for CCR5 delta 32. Lancet, 1997. 349(9060): p. 1219.
27. Dean, M., et al., Genetic restriction of HIV-1 infection and progression to AIDS by a deletion allele of the CKR5 structural gene. Hemophilia Growth and Development Study, Multicenter AIDS Cohort Study, Multicenter Hemophilia Cohort Study, San Francisco City Cohort, ALIVE Study. Science, 1996. 273(5283): p. 1856-62.
28. Eugen-Olsen, J., et al., Heterozygosity for a deletion in the CKR-5 gene leads to prolonged AIDS-free survival and slower CD4 T-cell decline in a cohort of HIV-seropositive individuals. Aids, 1997. 11(3): p. 305-10.
29. Huang, Y., et al., The role of a mutant CCR5 allele in HIV-1 transmission and disease progression. Nat Med, 1996. 2(11): p. 1240-3.
192
30. Meyer, L., et al., Early protective effect of CCR-5 delta 32 heterozygosity on HIV-1 disease progression: relationship with viral load. The SEROCO Study Group. Aids, 1997. 11(11): p. F73-8.
31. Kapoor, A., et al., Sequencing-based detection of low-frequency human immunodeficiency virus type 1 drug-resistant mutants by an RNA/DNA heteroduplex generator-tracking assay. J Virol, 2004. 78(13): p. 7112-23.
32. Palmer, S., et al., Persistence of nevirapine-resistant HIV-1 in women after single-dose nevirapine therapy for prevention of maternal-to-fetal HIV-1 transmission. Proc Natl Acad Sci U S A, 2006. 103(18): p. 7094-9.
33. Palmer, S., et al., Multiple, linked human immunodeficiency virus type 1 drug resistance mutations in treatment-experienced patients are missed by standard genotype analysis. J Clin Microbiol, 2005. 43(1): p. 406-13.
34. Ronaghi, M., M. Uhlen, and P. Nyren, A sequencing method based on real-time pyrophosphate. Science, 1998. 281(5375): p. 363, 365.
35. Margulies, M., et al., Genome sequencing in microfabricated high-density picolitre reactors. Nature, 2005. 437(7057): p. 376-80.
36. Wang, C., et al., Characterization of mutation spectra with ultra-deep pyrosequencing: application to HIV-1 drug resistance. Genome Res, 2007. 17(8): p. 1195-201.
37. Pop, M. and S.L. Salzberg, Bioinformatics challenges of new sequencing technology. Trends Genet, 2008. 24(3): p. 142-9.
38. Brockman, W., et al., Quality scores and SNP detection in sequencing-by-synthesis systems. Genome Res, 2008.
39. Quinlan, A.R., et al., Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nat Methods, 2008. 5(2): p. 179-81.
40. Hoffmann, C., et al., DNA bar coding and pyrosequencing to identify rare HIV drug resistance mutations. Nucleic Acids Res, 2007. 35(13): p. e91.
193
41. O'Meara, D., et al., Monitoring resistance to human immunodeficiency virus type 1 protease inhibitors by pyrosequencing. J Clin Microbiol, 2001. 39(2): p. 464-73.
42. Trombetti, G.A., et al., Data handling strategies for high throughput pyrosequencers. BMC Bioinformatics, 2007. 8 Suppl 1: p. S22.
43. Edgar, R.C., MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res, 2004. 32(5): p. 1792-7.
44. Smith, T.F. and M.S. Waterman, Identification of common molecular subsequences. J Mol Biol, 1981. 147(1): p. 195-7.
45. Guindon, S. and O. Gascuel, A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol, 2003. 52(5): p. 696-704.
46. Lewis, M., et al., Evaluation of an Ultra-Deep Sequencing Method to Identify Minority Sequence Variants in the HIV-1 env Gene from Clinical Samples, in 14th Conference on Retroviruses and Opportunistic Infections. 2007: Los Angeles, USA.
47. Grenfell, B.T., et al., Unifying the epidemiological and evolutionary dynamics of pathogens. Science, 2004. 303(5656): p. 327-32.
194
Chapter 8: Final Discussion
The extensive diversity generated within HIV-1’s phylogeny is daunting in
relation to the production of a successful therapeutic or preventative vaccine
[1-3]. Recently, after more than 33 million deaths since the introduction of
the virus into the human population, the scale of this task has been
highlighted by the failure of the once promising Merck STEP vaccine clinical
trials [4, 5]. This, along with other failures [3, 6], suggests that some
significant decisions about the direction of research in relation to HIV-1 must
be made. For example: in the face of such diversity is it actually possible to
design a successful vaccine or are researchers misplacing their efforts? If
the latter is the case should there be more focus on non vaccine based
strategies of reducing viral load within individual hosts such as co-receptor
blockage? From the chapters of this thesis it could be suggested that it is
not yet time to abandon hope on the development of a therapeutic or
preventative vaccine – although novel algorithmic approaches in relation to
its development are required as the traditional approaches to vaccine design
will not be effective in dealing with the diversity being generated by HIV-1.
Let us begin this concluding chapter by returning to the clear quantifiable
differences that exist between the global group M subtypes and the poorly
defined group O clusters [7]. In chapter 4, in agreement with Roques et al.,
[8], it was proposed that it is inappropriate to draw direct parallels between
these two groups in relation to diversity present within clusters observed on
each of their phylogenetic tree topologies. The weak clustering found within
the group O phylogenetic topology has emerged on a localized scale and is
not significant in relation to randomly generated trees (Fig. 4.3). As a result
group O clusters contain little useful sequence information in relation to
designing cluster specific vaccines. Usage of individual strains, consensus
sequences [9-11] or even novel algorithmic approaches (chapter 6) in
attempts to develop a cluster specific vaccine for the group would most likely
195
fail and would be a wasteful diversion of resources. In contrast the well
defined global group M subtypes have emerged as a result of a more
complicated epidemiological history (Fig. 4.5) involving the passaging of the
virus in association with different localized risks groups following severe a
founder effect [12, 13]. However, even in the case of these well defined
subtypes the use of subtype specific vaccines, although initially promising
[11], fails as a result of the extensive diversity present. Traditional
approaches for vaccine design [9-11] are simply not going to work.
Novel algorithmic approaches, moving away from a dependency on tree
topology, are required in order to generate artificial sequence constructs that
maximize the coverage of the epitope diversity present within subsets of the
sequence variation constituting the global pandemic [14, 15]. Such
constructs could then potentially be used as a starting point for the design of
a polyvalent vaccine. To date, early attempts at algorithmic approaches,
although promising, have largely ignored the realism that not all variation
within sampled strains is of significance (Fig. 6.2, Panel A). With the
removal of “unimportant” variation improved algorithms can be developed
(Fig. 6.5) in order to construct artificial sequence constructs (Fig 8.1) that
maximize the epitope coverage across the meaningful diversity within
individual subtypes in relation to specific geographic locations or within
specific risk groups (Fig. 6.4).
Such approaches could be improved by accounting for additional features of
protein structure and function, such as functional sites [16], the requirement
to maintain binding interface [17], and intra- and inter- molecular interactions
[18], as well as by the inclusion of known antigenic epitopes [19]. The latter
is especially important to consider as by maximizing nine-mer coverage
across the alignment, without consideration of “good” epitope inclusion, the
algorithm is implicitly selecting for the inclusion of more conserved nine-mers
within the construct sequences (Fig. 8.1). Such nine-mers themselves may
196
not be the ideal antigenic targets as intuitively their conservation implies that
they may be under less immune pressure than the less conserved nine-
mers. It is therefore important to insure the inclusion of known epitopes
along with the inclusion of high frequency nine-mers.
Figure 8.1 Novel Approach to Generating Optimized Constructs
Examples of sequence constructs generated in chapter 6 across the p17 region for HIV-1
group M sequences (orange box). The plots were taken from figure 6.4 panels A and C.
The numbers below the sequence’s in the orange box represent the local covering score of
selected nine-mers as described in chapter 6.
These improvements can be incorporated, with some minor modifications to
our algorithm (Fig. 6.5), into the approach that we have taken within chapter
6 during the sequence construction process after unnecessary diversity has
been removed from the input datasets. This is the topic of future work. The
sooner the effects of these improvements are investigated the sooner we will
know if efforts to produce a therapeutic vaccine are viable. If such novel
approaches fail, like previous more traditional attempts [3-6], the future for
vaccine design will start to look bleak.
197
If polyvalent cocktails of artificial sequence constructs, providing adequate
epitope coverage for a given subtype within specific geographic regions or
specific risk groups, could eventually be created using an algorithmic
approach – the problems caused by the extensive diversity present within
HIV-1’s phylogeny would still be far from over. This is because the subtype
classification system has been devised through the sampling of strains
globally [12, 13]. Presently strains from the DRC region are classified
according to the global subtype that they fall closest to (Chapter 4). The
distinctness of these “globally”-defined “pre-DRC data” subtypes is due to
individual strains being exported from the DRC region followed by
subsequent diversification within localized host populations (Fig. 4.5). They
do not accurately represent the diversity that is present within the centre of
the pandemic (Fig. 4.4). Classifying strains from the DRC region according
to the current global classification system hugely under represents this
extensive diversity within the region [12, 13, 20].
As a result a globally defined subtype specific cocktail of artificially
constructed sequences could not guarantee protection within the centre of
the pandemic for the strains classified according the subtype that the cocktail
was designed for. Further more within specific geographical regions or
global risk groups a single subtype specific cocktail could not guarantee
adequate protection against newly emerging divergent strains from the
epicenter that have been loosely classified according to the current global
subtype classification system. Recombination between the current
subtypes, where geographic regions or risk groups come into contact with
each other - thus allowing for the occurrence of dual infection, would also
lead to problems for such subtype specific polyvalent cocktails due to the
generation of new divergent strains. A dynamic approach to vaccine
development is thus required that would involve the constant monitoring of
the diversity present within subtypes, risk groups and geographic regions
198
followed by the continuous updating of the sequences included within the
polyvalent cocktail covering these specific risk groups and regions.
Despite these problems the potential ability to cover large portions of the
global population away from the epicenter with a preventative or therapeutic
vaccine would be a breakthrough in relation to tackling the global pandemic.
But with the epicenter left without a preventative or therapeutic vaccine, even
within a globally vaccinated population there will always be the possibility of
reemerging strains forming new resistant clusters (Fig 4.5). Unfortunately
the extent of diversity present within the epicenter (Fig 4.4) makes the
possibility for the development of a polyvalent vaccine for the region highly
remote. Thus, although it may be eventually possible to control the virus
globally and reduce the number of infections, the threat of remerging strains
will remain a long term problem for the foreseeable future.
As briefly mentioned above a major contributory factor to this problem are
recombination events which contribute significantly to the diversity present
within HIV-1’s phylogeny at both an intra- and inter- host level (Chapter 5).
However not all recombinant breakpoints are equivalent in relation to the
persistence of the virus within the global pandemic. Across much of the viral
genome (Fig. 5.4) breakpoint positioning can be explained by a simple
mechanistic process that is largely described by the model presented in
Chapter 5. Regions that deviate from the underlying mechanistic breakpoint
distribution are clearly of greater significance in terms of the global
pandemic. Primarily either end of the envelope gene identifies it as a region
that is selectively transferred from one genetic background into another as
an integral cassette.
There is a need to focus on the identifying the functional significance of
recombination in vivo rather than the blind mapping of breakpoints. Our
results in Chapter 5 emphasize that detailed mapping of individual HIV-1
199
recombinant structures should be considered in the context of a probabilistic
expectation generated by the process of template switching during reverse
transcription. Individual recombination breakpoints, analogous to point
mutations, will have varying consequences for viral persistence in infected
individuals and populations. Our findings provide the first clear indication of
how recombinant forms predominantly influence the viral population in the
ongoing AIDS pandemic.
So far within this concluding chapter the focus has largely been on HIV-1’s
diversity in relation to an inter host level. At an intra patient level the
generation of diversity is also immense [21, 22]. However during the early
stages of infection there is a large dominance of the CCR5-using phenotype
[23]. Reasons for this are unknown but may include selection at the point of
transmission [24-26], a higher cytopathicity of CXCR4-using variants
resulting in host cells with a shorter life span [27] as well as differing fitness
levels of the two variants at different stages of disease progression [28]. The
consequence of the CCR5-using phenotype dominance is that it provides a
novel drug target early on within the host.
The recently developed a small-molecule drug, Maraviroc, attempts to
control viremia by blocking the CCR5 co-receptor thus making it unavailable
for HIV-1 cell entry [29, 30]. This effectively simulates the CCR5Δ32
phenotype [31-34] observed within some individuals that results in either a
delayed progression to AID’s (heterozygous) or a strong resistance to HIV
(homozygous). However with the observation that the presence of the
CXCR4-using phenotype pre treatment predicts treatment failure [35] it has
become vitally important to the producers of Maraviroc, Pfizer, to develop
cost effective genotypic system for the detection of such variants that exist
within the population in order to determine the effectiveness of the drug
within an individual. Pyrosequencing technology makes the cheap and
accurate detection of these low frequency variants possible [36-39] for
200
individual patients. Until recently the analysis of large amounts of short
pyrosequencing data has proven a monumental task [40].
In chapter 7 we develop and present a protocol (Fig. 7.4) that simplifies the
handling of pyro sequenced data. We used the protocol in conjunction with
three known HIV-1 phenotype tests in order to detect the pre treatment
presence of low frequency CXCR4-using variants within an individual drug
failure patient. Phylogenetic analysis (Fig. 7.8) and phenotype test results
(Table 7.2) strongly confirmed, in agreement with Westby et al., [35], that
these variants initially emerged pre treatment and that the drug was most
probably not responsible. Later on during treatment the low frequency
variants were selected for and so there was a failure in the reduction of
viremia due to the positive selection of the CXCR4-using phenotype. In a
well managed environment, with sensitive low frequency variant detection
tools – such as that presented in figure 7.4, maraviroc remains a good
candidate for the control of viremia within some infected individuals. With
such novel compounds and new approaches to vaccine design it is possible
to remain optimistic about the future development of preventative and
treatment strategies for HIV.
In relation to the protocol presented in figure 7.4 future work will involve
implementing this protocol as a software package that can be used routinely
to process such data. Such a package will make the advantages of
pyrosequencing readily available to large portions of the HIV-1 research
community.
201
Conclusion
With our current understanding of the diversity present within HIV-1 there
appears to be potential for progress in relation to the development of
therapeutic or preventative vaccine in the not to distant future. If this does
fail however there is still the possibility of developing successful treatment
regimes with novel compounds such as maraviroc. The former would be a
more affordable option in relation to the many countries where expensive
and tiresome drug regimes would not be practical. The latter would require
the development of more sophisticated tools for the careful monitoring of low
frequency viral variants within individual hosts. Despite this optimism, if a
successful vaccine or treatment regime were to be produced, HIV-1 is going
to remain a long term problem within the human population. This is primarily
as a result of the diversity generated at the epicenter of the pandemic. The
best we can hope for in the near future is the better management of the
pandemic in relation to the geographic areas outside of this region.
202
References
1. Garber, D.A., G. Silvestri, and M.B. Feinberg, Prospects for an AIDS vaccine: three big questions, no easy answers. Lancet Infect Dis, 2004. 4(7): p. 397-413.
2. Sheppard, N. and Q. Sattentau, The prospects for vaccines against HIV-1: more than a field of long-term nonprogression? Expert Rev Mol Med, 2005. 7(2): p. 1-21.
3. Cohen, J., Planned tests in Thailand spark debate. Science, 1997. 276(5316): p. 1197.
4. HIV vaccine failure prompts Merck to halt trial. Nature, 2007. 449(7161): p. 390.
5. Cohen, J., AIDS research. Promising AIDS vaccine's failure leaves field reeling. Science, 2007. 318(5847): p. 28-9.
6. McCarthy, M., HIV vaccine fails in phase 3 trial. Lancet, 2003. 361(9359): p. 755-6.
7. Archer, J. and D.L. Robertson, Understanding the diversification of HIV-1 groups M and O. Aids, 2007. 21(13): p. 1693-700.
8. Roques, P., et al., Phylogenetic analysis of 49 newly derived HIV-1 group O strains: high viral diversity but no group M-like subtype structure. Virology, 2002. 302(2): p. 259-73.
9. Gaschen, B., et al., Diversity considerations in HIV-1 vaccine selection. Science, 2002. 296(5577): p. 2354-60.
10. Nickle, D.C., et al., Consensus and ancestral state HIV vaccines. Science, 2003. 299(5612): p. 1515-8; author reply 1515-8.
11. Weaver, E.A., et al., Cross-subtype T-cell immune responses induced by a human immunodeficiency virus type 1 group m consensus env immunogen. J Virol, 2006. 80(14): p. 6745-56.
12. Rambaut, A., et al., Human immunodeficiency virus. Phylogeny and the origin of HIV-1. Nature, 2001. 410(6832): p. 1047-8.
203
13. Worobey, M., The Origins and Diversification of HIV. Global HIV/AIDS Medicine, 2007: p. 13–21.
14. Fischer, W., et al., Polyvalent vaccines for optimal coverage of potential T-cell epitopes in global HIV-1 variants. Nat Med, 2007. 13(1): p. 100-6.
15. Nickle, D.C., et al., Coping with viral diversity in HIV vaccine design. PLoS Comput Biol, 2007. 3(4): p. e75.
16. Fiorentini, S., et al., Functions of the HIV-1 matrix protein p17. New Microbiol, 2006. 29(1): p. 1-10.
17. Chelliah, V., et al., Distinguishing structural and functional restraints in evolution in order to identify interaction sites. J Mol Biol, 2004. 342(5): p. 1487-504.
18. Overington, J., et al., Tertiary structural constraints on protein evolutionary diversity: templates, key residues and structure prediction. Proc R Soc Lond B Biol Sci, 1990. 241(1301): p. 132-45.
19. Frahm N, L.C., Brander C, Identification of HIV-Derived, HLA Class I Restricted CTL Epitopes: Insights into TCR Repertoire, CTL Escape and Viral Fitness. HIV Sequence Compendium, ed. F.B. Thomas Leitner, Hahn B, Marx P, McCutchan F, Mellors J, Wolinsky S, Korber B. 2006/2007: Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, Los Alamos, NM.
20. Vidal, N., et al., Unprecedented degree of human immunodeficiency virus type 1 (HIV-1) group M genetic diversity in the Democratic Republic of Congo suggests that the HIV-1 pandemic originated in Central Africa. J Virol, 2000. 74(22): p. 10498-507.
21. Korber, B., et al., Evolutionary and immunological implications of contemporary HIV-1 variation. Br Med Bull, 2001. 58: p. 19-42.
22. Taylor, J.E. and B.T. Korber, HIV-1 intra-subtype superinfection rates: estimates using a structured coalescent with recombination. Infect Genet Evol, 2005. 5(1): p. 85-95.
204
23. Connor, R.I., et al., Change in coreceptor use coreceptor use correlates with disease progression in HIV-1--infected individuals. J Exp Med, 1997. 185(4): p. 621-8.
24. Agace, W.W., et al., Constitutive expression of stromal derived factor-1 by mucosal epithelia and its role in HIV transmission and propagation. Curr Biol, 2000. 10(6): p. 325-8.
25. Meng, G., et al., Primary intestinal epithelial cells selectively transfer R5 HIV-1 to CCR5+ cells. Nat Med, 2002. 8(2): p. 150-6.
26. Reece, J.C., et al., HIV-1 selection by epidermal dendritic cells during transmission across human skin. J Exp Med, 1998. 187(10): p. 1623-31.
27. Rodrigo, A.G., Dynamics of syncytium-inducing and non-syncytium-inducing type 1 human immunodeficiency viruses during primary infection. AIDS Res Hum Retroviruses, 1997. 13(17): p. 1447-51.
28. Arien, K.K., et al., Replicative fitness of CCR5-using and CXCR4-using human immunodeficiency virus type 1 biological clones. Virology, 2006. 347(1): p. 65-74.
29. Dorr, P., et al., Maraviroc (UK-427,857), a potent, orally bioavailable, and selective small-molecule inhibitor of chemokine receptor CCR5 with broad-spectrum anti-human immunodeficiency virus type 1 activity. Antimicrob Agents Chemother, 2005. 49(11): p. 4721-32.
30. Fatkenheuer, G., et al., Efficacy of short-term monotherapy with maraviroc, a new CCR5 antagonist, in patients infected with HIV-1. Nat Med, 2005. 11(11): p. 1170-2.
31. Dean, M., et al., Genetic restriction of HIV-1 infection and progression to AIDS by a deletion allele of the CKR5 structural gene. Hemophilia Growth and Development Study, Multicenter AIDS Cohort Study, Multicenter Hemophilia Cohort Study, San Francisco City Cohort, ALIVE Study. Science, 1996. 273(5283): p. 1856-62.
32. Eugen-Olsen, J., et al., Heterozygosity for a deletion in the CKR-5 gene leads to prolonged AIDS-free survival and slower CD4 T-cell decline in a cohort of HIV-seropositive individuals. Aids, 1997. 11(3): p. 305-10.
205
33. Huang, Y., et al., The role of a mutant CCR5 allele in HIV-1 transmission and disease progression. Nat Med, 1996. 2(11): p. 1240-3.
34. O'Brien, T.R., et al., HIV-1 infection in a man homozygous for CCR5 delta 32. Lancet, 1997. 349(9060): p. 1219.
35. Westby, M., et al., Emergence of CXCR4-using human immunodeficiency virus type 1 (HIV-1) variants in a minority of HIV-1-infected patients following treatment with the CCR5 antagonist maraviroc is from a pretreatment CXCR4-using virus reservoir. J Virol, 2006. 80(10): p. 4909-20.
36. Huse, S.M., et al., Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol, 2007. 8(7): p. R143.
37. Margulies, M., et al., Genome sequencing in microfabricated high-density picolitre reactors. Nature, 2005. 437(7057): p. 376-80.
38. Ronaghi, M., M. Uhlen, and P. Nyren, A sequencing method based on real-time pyrophosphate. Science, 1998. 281(5375): p. 363, 365.
39. Wang, C., et al., Characterization of mutation spectra with ultra-deep pyrosequencing: application to HIV-1 drug resistance. Genome Res, 2007. 17(8): p. 1195-201.
40. Trombetti, G.A., et al., Data handling strategies for high throughput pyrosequencers. BMC Bioinformatics, 2007. 8 Suppl 1: p. S22.
206
Appendix I: HIV-1 group M Gag and Pol Trees
Supplementary Figure 1 Phylogenetic History of HIV-1 Group M
Panel A displays a tree that was re-constructed using global-group M gag sequences.
Panel B displays a tree that was re-constructed using global-group M pol sequences. The
‘*’ indicates bootstrap support greater than 90%. The scale bar corresponds to nucleotide
substitutions per site. In panels A and B the bold lettering corresponds to the subtype
designations from the LANL HIV Sequence Database.
207
Appendix II: HIV-1 group O Gag and Pol Trees
Supplementary Figure 2 Phylogenetic History of HIV-1 Group O
Panel A displays a tree that was re-constructed using group O gag sequences. Panel B
displays a tree that was re-constructed using group O pol sequences. The ‘*’ indicates
bootstrap support greater than 90%. The scale bar corresponds to nucleotide substitutions
per site. The bold numbers represent the clusters that correspond to previously proposed
clusters as described in chapter 4.
208
Appendix III: Calculating the Reduction in Breakpoint Occurrence
Supplementary Figure 3 Reduction in Breakpoint Occurrence
The triangles represent the frequency of breakpoints occurring within breakpoint zones of
size 5 or less. The rectangles represent the frequency of breakpoints occurring within zones
of greater than size 5. The slope of the line of best fit through the data within the smaller
breakpoint zones is 0.0153 while for the larger zones the slope of the line of best fit is
0.0415. The α parameter of equations 5 and 6 from chapter 5 is the ration between these
(0.37).
209
Appendix IV: Amino Acid Occurrence within the p17
Supplementary Table 1 Probability of Amino Acid Occurrence within the p17 gene
210
Appendix V: TP and FP rates for the Reduction Model
Supplementary Table 2 TP and FP rates for the Model Described in Figure 6.1